## By Dr Gwinyai Nyakuengama

## (25 July 2018)

**INTRODUCTION**

Welcome to our Stata blog!

The point of this blog job is to have fun and to showcase the powerful Stata capabilities for logistic regression and data visualization.

**KEY WORDS**

Stata; Logistic Regression; Modelling; Receiver Operator Curve (ROC); Specificity; Sensitivity; Customer Churn; Model performance matrix; Cross-validation; Accuracy.

**Research questions**

In the last blog, we presented Survival Data Analysis models in Stata for studying time-to-events in tel-co customers, namely churning.

In this blog, we will continue to take advantage of Stata’s expansive data analysis and visualization capabilities to further study the customer characteristics and service history as determinants of churning.

We will attempt to answer the following operational business questions:

- What are the key determinants of service churning, from a customer’s perspective?
- Could relative importance of those determinants be ranked?
- How reliable can these factors be estimated? How stable are they? What are the shortfalls of such approaches?

**METHOD**

In this blog, we used the same dataset previously described in the last blog on Survival Data Analysis in Stata as follows.

We imported a csv file into Stata version 15, as described before. We built a logistic regression model with the response variable churning presented as a binary variable with a yes/no response, tested performance and reported the results. We also fitted a validated logistic regression model using half of the dataset to train and the other half to test the model.

**RESULTS**

## Fit a high level regression model

**Stata command**: logistic b_churn SEX SENIORCITIZEN PARTNERED DEPENDENT MULTIPLELINES CONTRACT PAPERLESS TENURE_GROUPS, nolog

**Interpretation**

Results suggest that:

- the fitted regression model was statistically significant, judging by the (Prob>chi2 =0.000)
- all predictor variables, but sex and partnered, were highly significant in determining the risk to churn
- statistically, significant odds ratios greater than one suggest that:
- customers using paper based transactions were twice likely to churn compared to paperless transactions
- customers with multiple lines were also nearly twice likely to churn compared to those with single lines
- senior citizens were 1.6 times more likely to churn than non-senior citizens

- statistically, significant odds ratios less than one suggest that:
- customers with longer tenures and with contract and with dependents were less likely to churn

In the main, these results mirrors those reported previously for this dataset by Li (2017) and Treselle Engineering (2018) from a logistic regression model using R programming language.

## Test goodness of fit of the model

**Stata command**: lfit, group(10) table

**Interpretation**

Results suggest that the fitted model was a good fit, judging the non-significant Prob > chi2 statistic.

## Fit a detailed regression model

**Stata command**: logistic b_churn i.SEX i.SENIORCITIZEN i.PARTNERED i.DEPENDENT i.MULTIPLELINES i.CONTRACT i.PAPERLESS i.TENURE_GROUPS , nolog

**Interpretation**

Results suggest:

- confirm those described above, additionally, we see the relative magnitudes of odds ratios of the components of each predictive variable.
- compared to
- month-to-month, the risk to churn decreased the longer the contract
- the 0-12 month tenures, the tendency to churn increased the longer the tenure.

## Test goodness of fit of the model

**Stata command**: lfit, group(10) table

**Interpretation**

Results suggest that the fitted model was a good fit, judging the non-significant Prob > chi2 statistic.

## Test multicollinearity

**Stata command**: collin b_churn SEX SENIORCITIZEN PARTNERED DEPENDENT MULTIPLELINES CONTRACT PAPERLESS TENURE_GROUPS

Variable | VIF | SQRT VIF | Tolerance | R Squared | |

b_churn | 1.27 | 1.13 | 0.7888 | 0.2112 | |

SEX | 1.00 | 1.00 | 0.9994 | 0.0006 | |

SENIORCITIZEN | 1.13 | 1.06 | 0.8887 | 0.1113 | |

PARTNERED | 1.45 | 1.20 | 0.6912 | 0.3088 | |

DEPENDENTS | 1.37 | 1.17 | 0.7292 | 0.2708 | |

MULTIPLELINES | 1.20 | 1.10 | 0.8334 | 0.1666 | |

CONTRACT | 2.02 | 1.42 | 0.4942 | 0.5058 | |

PAPERLESS | 1.11 | 1.06 | 0.8982 | 0.1018 | |

TENURE_GROUPS | 2.20 | 1.48 | 0.4542 | 0.5458 | |

Mean VIF | 1.42 | ||||

Eigenval | Cond Index | ||||

1 7.2918 | 1 | ||||

2 0.9687 | 2.7436 | ||||

3 0.7112 | 3.2021 | ||||

4 0.5048 | 3.8005 | ||||

5 0.1789 | 6.3843 | ||||

6 0.1157 | 7.9376 | ||||

7 0.0915 | 8.9248 | ||||

8 0.0672 | 10.4135 | ||||

9 0.0493 | 12.1597 | ||||

10 0.0207 | 18.773 | ||||

Condition Number | 18.773 | ||||

Eigenvalues & Cond Index computed from scaled raw sscp (w/ intercept) | |||||

Det(correlation matrix) 0.2128 |

**Interpretation**

Results do not suggest serious multicollinearity (also collinearity) issues, since the mean and individual Variance Inflation Factors (VIF) are well below 4.

Note:

A VIF of 1 means that there is no correlation among the k^{th} predictor and the remaining predictor variables, and hence the variance of b_{k} is not inflated at all.

## Model performance assessment

### Receiver Operator Curve (ROC)

**Stata command**:

predict double xb, xb /// roctab b_churn xb

**Interpretation**

Results suggest that the fitted logistic model correctly classified churning / non-churning cases with an overall accuracy of 78%. While statistical methods are usually not directly comparable between studies, this current result closely mirrors those previously reported for this dataset by Li (2017) and Treselle Engineering (2018). These scholars used R programming language to fit a logistic regression.

Notes:

**True-positive** rate is also known as Sensitivity, recall or probability of detection.

**True-negative** rate is also known as Specificity. It measures the proportion of actual negatives that are correctly identified.

### Sensitivity / Specificity analysis vs Probability cut-off

**Stata command**: lsens

Notes:

The probability cut-off point determines the sensitivity (fraction of true positives to all with churning) and specificity (fraction of true negatives to all without churning).

### Receiver Operator Curve analysis

**Stata command**:

roctab b_churn xb /// roctab b_churn xb , graph // with graph

**Interpretation**

The above results suggest that our logistic regression model was good at picking out churners, judging by its area under the ROC curve of 81%.

Statistics around the ROC estimate are shown in the accompanying table, above.

Notes:

The Receiver Operator Curve (ROC) is a graphical plot that illustrates the diagnostic ability of a binary classifier system, in our case the logistic regression, as its discrimination threshold is varied.

### Predictive margins

This section shows the predictive margin statistics and plots for predictor variables used in our logistic regression model. Most importantly, we use the margins to get the predicted probabilities of customers to churn on account of the predictor variables.

**Stata command**: margins SENIORCITIZEN /// marginsplot

**Interpretation**

Results suggest that if the distribution of churning remained the same in the population, but everyone was not a senior citizen, we would expect about 25% to churn. If everyone were senior citizens; 33% – which effectively means the latter group were more likely to churn.

**Stata command**: margins SEX ///

marginsplot, xdimension(SEX)

**Interpretation**

Results suggest that if the distribution of churning remained the same in the population, but everyone was female, we would expect about 27% to churn. If everyone were male; 26% – which effectively means no gender effect on probability to churn.

**Stata command**: margins PARTNERED///

marginsplot, xdimension(PARTNERED)

**Interpretation**

Results suggest that if the distribution of churning remained the same in the population, but everyone was not partnered, we would expect about 26% to churn. If everyone were partnered; 27% – which effectively means no partner effect on probability to churn.

**Stata command**: margins PAPERLESS///

marginsplot, xdimension(PAPERLESS)

**Interpretation**

Results suggest that if the distribution of churning remained the same in the population, but everyone was not on paperless plan, we would expect about 20% to churn. If everyone were on a paperless plan; 30% – which effectively means more would churn if on a paperless plan.

**Stata command**: margins DEPENDENTS ///

marginsplot, xdimension(DEPENDENTS)

**Interpretation**

Results suggest that if the distribution of churning remained the same in the population, but everyone had no dependents, we would expect about 28% to churn. If everyone were on a paperless plan; 23% – which effectively means fewer would churn if they had dependents.

**Stata command**: margins MULTIPLELINES ///

marginsplot, xdimension(MULTIPLELINES)

**Interpretation**

Results suggest that if the distribution of churning remained the same in the population, but everyone did not have multiple-lines, we would expect about 23% to churn. If everyone multiple-lines; 32% – which effectively more would churn if they had had multiple-lines.

**Stata command**: margins TENURE_GROUPS ///

marginsplot, xdimension(TENURE_GROUPS)

**Interpretation**

Results suggest that if the distribution of churning remained the same in the population, but everyone were on short tenures (0-12 month), we would expect about 38% to churn. If everyone had longer and longer tenures, we would see that the propensity to churn would progressively decrease – down to 15% in customers with tenure longer than 60 months.

### Cross-validated model

**Interpretation**

Results from a cross-validated logistic regression model yielded similar results to the full model (ROC = 81%) . No further analysis was required.

Notes:

Cross validation was performed using a user-written Stata do file called *CrossVal (see https://github.com/MIT-LCP/aline-mimic-ii/blob/master/Data_Analysis/STATA/crossval.ado ).*

**CONCLUSIONS**

In this blog, we conclude that:

- Overall the key determinants of customer service churning were tenure group, paperless, multiple-lines plans, contract type, senior citizen status and having dependents. Gender and partnership status had no influence on the likelihood to churn, in this study.
- To echo the words of Nyakuengama (2017), “
*Stata is a state-of-the-art tool-of-choice which facilitates timely, efficient, accurate and trusted evidence-based decision making”*:- Not only is Stata syntax consistent and simple to use to perform logistic regressions;
- Stata is methodologically are rigorous and is backed up by model validation and post-estimation tests.
- Current logistic regression results from Stata were reliable – accuracy of 78% and area under ROC of 81%.

- Results from this blog closely matched those reported by Li (2017) and Treselle Engineering (2018) and who separately used R programming to study churning in the same dataset used here.

**FUTURE BLOGS**

In this short blog, we had fun and demonstrated the benefits of using Stata to undertake rigorous logistic regression and, more importantly, provided further insights into customer churning.

Nonetheless, further insights may be obtainable when the structure and order within the dataset are also considered. There seems to be a logical hierarchy and / or sub-grouping of personal customer characteristics, their access types, service types and payment types. There may even be interactions between these.

In our future blogs we will try to investigate these issues using more sophisticated and advanced regression techniques now available in Stata version 15.

**BIBLIOGRAPHY**

J.G. Nyakuengama (2017): Stata A Key Strategic Statistical tool-of-choice in major impact evaluations of socioeconomic programs. 2017 Oceania Stata Users Group Meeting * https://www.stata.com/meeting/oceania17/slides/oceania17_Nyakuengama.pdf *

L. Oldja (2018): Survival Analysis to Explore Customer Churn in Python *https://towardsdatascience.com/survival-analysis-in-python-a-model-for-customer-churn-e737c5242822*

Treselle Engineering (2018): Customer Churn – Logistic Regression with R *http://www.treselle.com/blog/customer-churn-logistic-regression-with-r/*

S. Li (2017): Predict Customer Churn with R *https://towardsdatascience.com/predict-customer-churn-with-r-9e62357d47b4*