PREDICTIVE DATA ANALYSIS AND VISUALIZATION IN STATA – PART 1: LOGISTIC REGRESSION

By Dr Gwinyai Nyakuengama

(25 July 2018)

 

INTRODUCTION

Welcome to our Stata blog!

The point of this blog job is to have fun and to showcase the powerful Stata capabilities for logistic regression and data visualization.

 

 

KEY WORDS

Stata;  Logistic Regression; Modelling; Receiver Operator Curve (ROC); Specificity; Sensitivity; Customer Churn; Model performance matrix; Cross-validation; Accuracy.

 

Research questions

In the last blog, we presented Survival Data Analysis models in Stata for studying time-to-events in tel-co customers, namely churning.

In this blog, we will continue to take advantage of Stata’s expansive  data analysis and visualization capabilities to further study the customer characteristics and service history as determinants of churning.

We will attempt to answer the following operational business questions:

  • What are the key determinants of service churning, from a customer’s perspective?
  • Could relative importance of those determinants be ranked?
  • How reliable can these factors be estimated? How stable are they? What are the shortfalls of such approaches?

 

METHOD

In this blog, we used the same dataset previously described in the last blog on Survival Data Analysis in Stata as follows.

We imported a csv file into Stata version 15, as described before. We built a logistic regression model with the response variable  churning presented as a binary variable with a yes/no response, tested performance and reported the results. We also fitted a validated logistic regression model using half of the dataset to train and the other half to test the model.

 

RESULTS

Fit a high level regression model

Stata command: logistic b_churn  SEX SENIORCITIZEN PARTNERED DEPENDENT MULTIPLELINES CONTRACT PAPERLESS TENURE_GROUPS, nolog

T1.jpg

 

Interpretation

Results suggest that:

  • the fitted regression model was statistically significant, judging by the (Prob>chi2 =0.000)
  • all predictor variables, but sex and partnered, were highly significant in determining the risk to churn
  • statistically, significant odds ratios greater than one suggest that:
    • customers using paper based transactions were twice likely to churn compared to paperless transactions
    • customers with multiple lines were also nearly twice likely to churn  compared to those with single lines
    • senior citizens were 1.6 times more likely to churn than non-senior citizens
  • statistically, significant odds ratios less than one suggest that:
    • customers with longer tenures and with contract  and  with dependents were less likely to churn

In the main, these results mirrors those reported previously for this dataset by Li  (2017) and Treselle Engineering (2018) from a logistic regression model using R programming language.

 

Test goodness of fit of the model

Stata command: lfit, group(10) table

T2.jpg

 

Interpretation

Results suggest that the fitted model was a good fit, judging the non-significant Prob > chi2 statistic.

 

Fit a detailed regression model

Stata command: logistic b_churn i.SEX i.SENIORCITIZEN i.PARTNERED i.DEPENDENT i.MULTIPLELINES i.CONTRACT i.PAPERLESS i.TENURE_GROUPS , nolog

T4.jpg

Interpretation

Results suggest:

  • confirm those described above, additionally, we see the relative magnitudes of odds ratios of the components of each predictive variable.
  • compared to
    • month-to-month, the risk to churn decreased the longer the contract
    • the 0-12 month tenures, the tendency to churn increased the longer the tenure.

 

Test goodness of fit of the model

Stata command: lfit, group(10) table

T5.jpg

Interpretation

Results suggest that the fitted model was a good fit, judging the non-significant Prob > chi2 statistic.

 

Test multicollinearity

Stata command: collin b_churn SEX SENIORCITIZEN PARTNERED DEPENDENT MULTIPLELINES CONTRACT PAPERLESS TENURE_GROUPS

Variable  VIF  SQRT VIF Tolerance R Squared
b_churn          1.27         1.13 0.7888 0.2112
SEX          1.00         1.00 0.9994 0.0006
SENIORCITIZEN          1.13         1.06 0.8887 0.1113
PARTNERED          1.45         1.20 0.6912 0.3088
DEPENDENTS          1.37         1.17 0.7292 0.2708
MULTIPLELINES          1.20         1.10 0.8334 0.1666
CONTRACT          2.02         1.42 0.4942 0.5058
PAPERLESS          1.11         1.06 0.8982 0.1018
TENURE_GROUPS          2.20         1.48 0.4542 0.5458
Mean VIF          1.42
Eigenval Cond Index
1 7.2918 1
2 0.9687 2.7436
3 0.7112 3.2021
4 0.5048 3.8005
5 0.1789 6.3843
6 0.1157 7.9376
7 0.0915 8.9248
8 0.0672 10.4135
9 0.0493 12.1597
10 0.0207 18.773
Condition Number 18.773
Eigenvalues & Cond Index computed from scaled raw sscp (w/ intercept)
Det(correlation matrix)    0.2128

Interpretation

Results do not suggest serious multicollinearity (also collinearity) issues, since the mean and individual Variance Inflation Factors (VIF) are well below 4.

 

Note:

A VIF of 1 means that there is no correlation among the kth predictor and the remaining predictor variables, and hence the variance of bk is not inflated at all.

 

Model performance assessment

 

Receiver Operator Curve (ROC)

Stata command:
predict double xb, xb /// roctab b_churn xb

T7

Interpretation

Results suggest that the fitted logistic model correctly classified churning / non-churning cases with an overall accuracy of 78%. While  statistical methods are usually not directly comparable between studies, this current result closely mirrors those previously reported for this dataset by Li (2017) and Treselle Engineering (2018). These scholars used R programming language to fit a logistic regression.

 

Notes:

True-positive rate is also known as Sensitivity, recall or probability of detection.

True-negative rate is also known as Specificity. It measures the proportion of actual negatives that are correctly identified.

 

Sensitivity / Specificity analysis vs Probability cut-off

Stata command: lsens

Slide8-1.jpg

 

Notes:

The probability cut-off point determines the sensitivity (fraction of true positives to all with churning) and specificity (fraction of true negatives to all without churning).

 

 

Receiver Operator Curve analysis

Stata command:

roctab b_churn xb /// roctab b_churn xb , graph // with graph

T8

Slide9

Interpretation

The above results suggest that our logistic regression model was good at picking out churners, judging by its area under the ROC curve of 81%.

Statistics around the ROC estimate are shown in the accompanying table, above.

 

Notes:

The Receiver Operator Curve (ROC) is a graphical plot that illustrates the diagnostic ability of a binary classifier system, in our case the logistic regression, as its discrimination threshold is varied.

 

Predictive margins

This section shows the predictive margin statistics and plots for predictor variables used in our logistic regression model. Most importantly, we use the margins to get the predicted probabilities of customers to churn on account of the predictor variables.

Stata command: margins SENIORCITIZEN /// marginsplot

T11

Slide16

Interpretation

Results suggest that if the distribution of churning remained the same in the population, but everyone was not a senior citizen, we would expect about 25% to churn. If everyone were senior citizens; 33% – which effectively means the latter group were more likely to churn.

 

 

Stata command: margins SEX ///
marginsplot, xdimension(SEX)

T444

Slide14

Interpretation

Results suggest that if the distribution of churning remained the same in the population, but everyone was female, we would expect about 27% to churn. If everyone were male; 26% – which effectively means no gender effect on probability to churn.

 

 

Stata command: margins PARTNERED///
marginsplot, xdimension(PARTNERED)

T12

Slide18

Interpretation

Results suggest that if the distribution of churning remained the same in the population, but everyone was not partnered, we would expect about 26% to churn. If everyone were partnered; 27% – which effectively means no partner effect on probability to churn.

 

Stata command: margins PAPERLESS///
marginsplot, xdimension(PAPERLESS)

T16

Slide26.jpg

Interpretation

Results suggest that if the distribution of churning remained the same in the population, but everyone was not on paperless plan, we would expect about 20% to churn. If everyone were on a paperless plan; 30% – which effectively means more would churn if on a paperless plan.

 

Stata command: margins DEPENDENTS ///
marginsplot, xdimension(DEPENDENTS)

T13

Slide20

Interpretation

Results suggest that if the distribution of churning remained the same in the population, but everyone had no dependents, we would expect about 28% to churn. If everyone were on a paperless plan; 23% – which effectively means fewer would churn if they had dependents.

 

Stata command: margins MULTIPLELINES ///
marginsplot, xdimension(MULTIPLELINES)

T14

Slide22

Interpretation

Results suggest that if the distribution of churning remained the same in the population, but everyone did not have multiple-lines, we would expect about 23% to churn. If everyone multiple-lines; 32% – which effectively more would churn if they had had multiple-lines.

 

Stata command: margins TENURE_GROUPS ///
marginsplot, xdimension(TENURE_GROUPS)

T17

Slide28

Interpretation

Results suggest that if the distribution of churning remained the same in the population, but everyone were on short tenures (0-12 month), we would expect about 38% to churn. If everyone had  longer and longer tenures, we would see that the propensity to churn would progressively decrease – down to 15% in customers with tenure longer than 60 months.

 

Cross-validated model

crossvalided-ROC.png

Interpretation

Results from a cross-validated logistic regression model yielded similar results to the full model (ROC = 81%) . No further analysis was required.

 

Notes:

Cross validation was performed using a user-written Stata do file called CrossVal (see https://github.com/MIT-LCP/aline-mimic-ii/blob/master/Data_Analysis/STATA/crossval.ado ).

 

CONCLUSIONS

In this blog, we conclude that:

  •  Overall the key determinants of customer service churning were tenure group, paperless, multiple-lines plans, contract type, senior citizen status and having dependents. Gender and partnership status had no influence on the likelihood to churn, in this study.
  • To echo the words of Nyakuengama (2017), “Stata is a state-of-the-art tool-of-choice which facilitates timely, efficient, accurate and trusted evidence-based decision making”:
    • Not only is Stata syntax consistent and simple to use to perform logistic regressions;
    • Stata is methodologically are rigorous and is backed up by model validation and post-estimation tests.
    • Current logistic regression results from Stata were reliable – accuracy of 78%  and area under ROC of 81%.
  • Results from this blog closely matched those reported by Li (2017) and Treselle Engineering (2018) and who separately used R programming to study churning in the same dataset used here.

 

FUTURE BLOGS

In this short blog, we had fun and demonstrated the benefits of using Stata to undertake  rigorous logistic regression and, more importantly, provided further insights into customer churning.

Nonetheless, further insights may be obtainable when the structure and order within the dataset are also considered. There seems to be a logical hierarchy and / or sub-grouping of personal customer characteristics, their access types, service types and payment types. There may even be interactions between these.

In our future blogs we will try to investigate these issues using more sophisticated and advanced regression techniques now available in Stata version 15.

 

BIBLIOGRAPHY
J.G. Nyakuengama (2017): Stata A Key Strategic Statistical tool-of-choice in major impact evaluations of socioeconomic programs. 2017 Oceania Stata Users Group Meeting  https://www.stata.com/meeting/oceania17/slides/oceania17_Nyakuengama.pdf 

L. Oldja (2018): Survival Analysis to Explore Customer Churn in Python https://towardsdatascience.com/survival-analysis-in-python-a-model-for-customer-churn-e737c5242822

Treselle Engineering (2018): Customer Churn – Logistic Regression with R http://www.treselle.com/blog/customer-churn-logistic-regression-with-r/

S. Li (2017): Predict Customer Churn with R https://towardsdatascience.com/predict-customer-churn-with-r-9e62357d47b4

 

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.