Time Series Prediction of Daily Total Female Births in California – January, 1960

October 2, 2018October 7, 2018 predictivedatanalytics

By Dr Gwinyai Nyakuengama

(3 October 2018)

KEY WORDS

ARFIMA; Time Series; Daily female births in California; Stata; R package – Prophet

ACKNOWLEDGEMENT

We gratefully acknowledge:

StataCorp, Survey Design and Analysis Services: https://surveydesign.com.au/, Doctor Becketti (2013), authors of the R package – Prophet for their core macros in Stata and R language;
datamarket.com for their data; and
some anonymous colleagues.

Collectively, these parties not only inspired but underpinned this blog.

OBJECTIVE

To predict the 30-day, daily total female births in California, for January 1960.

METHOD

In this study:

Daily total female births (female for California reported in 1959 were accessed from datamarket.com .
Stata was used to test for stationarity in this time series data.
Stata was used to fit an Auto-Regressive Fractionally Integrated Moving Average
(ARFIMA) model and to predict the daily female births for the month of January, 1960.
Also, the R package – Prophet, was used to fit a time-series model with additive seasonalities, meaning the effect of the seasonality is added to the trend to get the forecast.

RESULTS

Cali_daily_birth_Slide3

This Stata plot of the daily female births in California for 1959 showed that the data has very high volatility.

This was suggestive of:

a non-stationary time series, and most importantly, the existence of a long-memory volatility in the series; and
an ARFIMA modelling solution to predict the daily female births for the month of January, 1960.

Cali_daily_birth_Slide4

These Stata auto-correlation and partial auto-correlation plots also suggested the presence of serial correlation in the female daily birth time series.

Cali_daily_birth_Slide5

Based on these Stata Dickey-Fuller test results, we failed to reject the null hypothesis of a random walk with a possible drift in the female daily births.

Cali_daily_birth_Slide6

In Stata, the commonly used criteria for choosing appropriate time series lags are Schwarz’s Bayesian information criterion (SBIC), the Akaike’s information criterion (AIC), Final Prediction Error (FPE) and the Hannan and Quinn information criterion (HQIC). It turns out that AIC works well on monthly data.

The above results from Stata’s vector auto-regressive selection order (vascor) macro indicate that the second lag (ar2) was picked by most decision criteria (i.e. FPE , AIC and HQIC). However, a lagged 1 period (ar1) was selected using the SBIC criterion.

Cali_daily_birth_Slide7

The DFGLS: Stata module to compute Dickey-Fuller/GLS unit root test command:

calculated the optimal lag length using a sequential t-test (Ng and Perron, 1995), Schwert criterion (SC) and the “modified AIC” (MAIC) statistical criteria as 6, 1 and 7, respectively; and
controls for a linear time trend by default unlike the Stata dfuller or pperron commands.

Based on these results:

we failed to reject the null hypothesis of a random walk with drift in the daily girl birth series; and
the daily female births were accurately estimated, judging by the relatively low root-mean-square error (rmse) of around seven daily girl births, considering the high volatility of this time series.

Cali_daily_birth_Slide8

The above Stata ARFIMA regression results suggested:

a significant model fit and, more importantly;
d, the fractionally differenced component of the predicted series, reflected a significant a fractionally integrated process with 0 < d < ½ ; and
the L1.ar and L1.ma were both significant.

This Stata plot shows:

that the dynamic forecasting (xb prediction) obtained using the ARFIMA model faithfully tracked the observed daily female births through out 1959; and
the 30-day, daily female birth prediction for January 1960, with the 90 per cent prediction intervals around the mean.

Cali_daily_birth_Slide10

Just focusing on the 30-day prediction from the Stata ARFIMA model:

the daily female births in January 1960, was around 43 births, contained within the 90% CI bands (see next figure); and
on average, this prediction had a root-mean-square error (rmse) of 7 daily female births (see next figure).

Cali_daily_birth_Slide11

The Stata ARFIMA model’s 30-day predictions in January 1960 show;

around 43 daily female births, with a;
root-mean-square error (rmse) of around 7 daily female births.
note that this figure agrees with the estimate shown earlier that was obtained using the Stata command, df-gls.

We also predicted the births using the R package – Prophet, tuned the predictions to 90% CI , same as in Stata.

This plot from the R package – Prophet shows:

periodicity in the daily female births; and
presence of outliers – dots outside the shaded blue area (90% CI).

The average root-mean-square error (rmse) from R was also around seven daily female births (or 7.2 exactly).

Cali_daily_birth_Slide15

Just focusing on the 30-day prediction in January 1960, these two plots from the R package – Prophet show:

strong weekly periodicity in the daily female births; with
peaks every Tuesday and Wednesday and troughs every Sunday.

CONCLUSION

The Stata ARFIMA model was an excellent fit of the highly volatile, daily female births in California for 1959.
On average, the Stata ARFIMA model predicted 43 daily female births (+/- seven births) for the month of January, 1959.
Pleasingly, both Stata and the R package – Prophet gave consistent and complementary results.
Additionally, the R package – Prophet picked up some strong weekly periodicity in the data – with most births occurring on Tuesdays and Wednesdays and the least births occurring on Sundays.

BIBLIOGRAPHY

Becketti S. (2013): Introduction to Time Series Using Stata 1st Edition, Stata Press https://www.amazon.com/Introduction-Time-Using-Stata-Becketti/dp/1597181323

Ivanov, V. and Kilian, L. 2001. ‘A Practitioner’s Guide to Lag-Order Selection for Vector Autoregressions’. CEPR Discussion Paper no. 2685. London, Centre for Economic Policy Research. http://www.cepr.org/pubs/dps/DP2685.asp

Prophet: https://facebook.github.io/prophet/docs/quick_start.html

Prophet R package: June 15, 2018 https://cran.r-project.org/web/packages/prophet/prophet.pdf

StataCorp 2013: Stata Time-Series Reference Manual https://www.stata.com/manuals/ts.pdf.

Most frequent words in the e-book “The Murray River”

September 30, 2018November 27, 2018 predictivedatanalytics

By Dr Gwinyai Nyakuengama

(1 October 2018)

TITLE: Most frequent words in the e-book “The Murray River – Being a Journal of the Voyage of the Lady Augusta Steamer from the Goolwa, in South Australia, to Gannewarra, above Swan Hill, Victoria. A distance from the sea mouth of 1400 miles“

E-book Author: Arthur Kinloch

Release Date: August 1, 2018 [EBook #57618]

Language: English

ACKNOWLEDGEMENTS: Text was gratefully sourced from the Gutenberg e-book Project

KEYWORDS

R; Word-cloud; The Murray River

INTRODUCTION

Welcome to our R blog!

In this short glob, we will use R to visualize the most frequent words in an e-book.

PROCEDURE

We executed the following R code to create the word-clouds, first with and then without information on the Gutenberg Project and copyright information:

########### start of program #####################
#turn warnings off
options(warn=-1)

#install required R packages

#install.packages(“tm”)#for text mining
#install.packages(“SnowballC”) # for text stemming
#install.packages(“wordcloud”) # word-cloud generator
#install.packages(“RColorBrewer”) # color palettes
#install.packages(“tidyverse”) # tidyverse

# load the required R packages
require(“tm”)
require(“SnowballC”)
require(“wordcloud”)
require(“RColorBrewer”)
require(tidyverse)

# load the text
filePath=”http://www.gutenberg.org/files/57618/57618-0.txt”
text <- readLines(filePath)

# load text as a corpus
docs <- Corpus(VectorSource(text))

#inspect(docs)

# transform text
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, ” “, x))
docs <- tm_map(docs, toSpace, “/”)
docs <- tm_map(docs, toSpace, “@”)
docs <- tm_map(docs, toSpace, “\\|”)
# clean up text
# convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# remove numbers
docs <- tm_map(docs, removeNumbers)
# remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords(“english”))
# remove own stop-word
#specify own stop-words as a character vector
docs <- tm_map(docs, removeWords, c(“project”,”gutenbergtm”,”gutenberg”, “one”))
# remove punctuations
docs <- tm_map(docs, removePunctuation)
# eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)

# create a term document matrix
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v), freq=v)
head(d, 10

# create word-cloud
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, “Dark2”))

########### end of program #####################

RESULTS

The most frequent words in The Murray River e-book , original text

The Murray River - Being a Journal of the Voyage of the Lady Augusta Steamer

The most frequent words in The Murray River e-book, minus the Gutenberg Project and copyright information

modified text

Interpretation

The most frequent words in this e-book are the Murray River, in South Australia. Words about steam navigation excellency, the waters, the considerable travel distance (miles) and time, the country-side, stations, towns and transported goods on the Lady Augusta steamer (e.g. sheep and wool) are also abundant in the e-book.

We could further remove more ‘filler- words’ (such as even, still, also, ever and within) from word-cloud.

USE OF RapidMiner – Auto Model TO PREDICT CUSTOMER CHURN

July 28, 2018August 1, 2018 predictivedatanalytics

By Dr Gwinyai Nyakuengama

(28 July 2018)

KEY WORDS

Customer Churn; RapidMiner Auto Model; Stata; Machine Learning Models; Naive Bayes; Generalized Linear Model (GLM); Logistic Regression; Deep Learning; Random Forest; Gradient Boosted Trees (XGBoost); Model performance; Receiver Operator Curve (ROC); Confusion Matrix; Accuracy; Specificity; Sensitivity.

INTRODUCTION

Previously, we studied customer churn using logistic and survival techniques in Stata, (see Nyakuengama 2018 a, b).

The RapidMiner Auto ML is a state-of-the-art tool with machine learning (ML) capabilities that:

are easy to use from a pull-down and point-and-click menus;
allow the user to simultaneously fit several ML models ; and
allow immediate optimization and ope-rationalization of the best ML models.

In this blog we seek to explore the business merits of the RapidMiner Auto Model for use as a fast and reliable tool-of-choice to predict customer churn.

METHOD

In this study we:

launched the RapidMiner Auto Model Studio (version 8.2);
loaded-up the same customer churn data from our previous blog on logistic regression (see Nyakuengama 2018 b);
selected churn as the target variable and same explanatory variables namely SEX, SENIORCITIZEN, PARTNERED, DEPENDENT, MULTIPLELINES, CONTRACT, PAPERLESS and TENURE_GROUPS
ran several machine learning models: Naive Bayes (NB); Generalized Linear Model (GLM); Logistic Regression (LR); Deep Learning (DL); Random Forest (RF) and Gradient Boosted Trees (XGBoost); and
evaluated how well the ML models predicted customer churn.

MODEL PERFORMANCE PARAMETERS

Terms and definitions in the table below from Saito and Rehmsmeier (2015) are used in assessing machine learning performance.

RESULTS

RUN-TIMES (ms)

Model run-time represent the time taken to build, test, validate a model and issue performance parameters.

INTERPRETATION

The above run-times from our RapidMiner Auto Model experiments suggest that the XGBoost model took far longer than the other models. The NB, GLM, LR and DT models had comparable and much shorter run-times. Those for the DL and RF models were slightly longer than for these last four models.

ROC COMPARISONS

The Receiver Operating Characteristic (ROC) is plot with the X-axis as the false positive rate (FPR) or 1-Specificity and the Y-axis as the true positive rate (TPR) or Sensitivity.

When evaluating between models in machine learning, the model with the largest area under the Receiver Operator Characteristic curve (AUROC) is the preferred one. The AUROC is defined as the probability that a randomly selected positive sample will have a higher prediction value than a randomly selected negative sample (Lan, 2017).

RapidMiner_ROC_Compared

AUROC

RapidMiner_99AUC

INTERPRETATION

In the above ROC plot and the average AUROC from our current RapidMiner Auto Model experiments we see that NB, GLM, LG, DL, XGBoost models had comparable AUROCs. The DF and RT models performed the worst, judging by this criterion.

ACCURACY

INTERPRETATION

The current GLM and LG models had superior accuracy rates (78%). The NB and XGBoost model performances were within two percentage points, but lower. Accuracy rates of the DT and RF model were the lowest (73%).

CLASSIFICATION ERROR

RapidMiner_8classification error.png

INTERPRETATION

GLM and LG models had superior classification rates (22%) while DT and RF performed the worst on this measure (27%). The results are the opposite of those seen for model accuracy, above.

PRECISION

RapidMiner_9Precision

INTERPRETATION

The Naive Bayes model and both the DF and RF models had the highest and lowest precision rates, respectively.

RECALL

RapidMiner_10Recall

INTERPRETATION

Notice recall is the same as sensitivity.

Notice recall was 100% both DF and RF. It is well known that tree-based models tend to over fit.Other than that, the DL and XGBoost models achieved very high recall rates than the rest of the models.

SENSITIVITY

RapidMiner_11Sensitivity

INTERPRETATION

Notice sensitivity is the same as recall and was 100% both DF and RF. It is well known that tree-based models tend to over fit. Other than that, DL and XGBoost models achieved very high recall rates than the rest of the models

SPECIFICITY

RapidMiner_12Specificity

INTERPRETATION

Notice again, that Specificity was 0% both DF and RF (the opposite of Sensitivity). Other than that, the NB had superior specificity among the remaining models.

F MEASURE

F1-score is the weighted harmonic mean of both the precision and recall, where the best F1-score is 1 and this worst value is 0.

f measure

INTERPRETATION

The F measures / scores of GLM and LG (86%) were marginally better than those of the other models (85%).

FEATURE IMPORTANCE (WEIGHTS)

RapidMiner_FeatureImportance

INTERPRETATION

The above results show that the important features for explaining customer churning were as follows; having a short contract, short tenure group or duration, paperless transactions, having dependents and being a senior citizen.

DISCUSSION

Like in the current blog, previous studies reported similar results for model accuracy, feature importance and other key model performance parameters for Logistic Regressions, using the same customer churn dataset (see Nyakuengama (2018 b) in using Stata, and Li (2017) and Treselle Engineering (2018) both using R programming language). In addition, Li (2017) had employed a number of ML techniques similar to those reported in this blog.

The decision on which ML model best predicts customer churn is very much business-model dependent. We simply note that from the current RapidMiner Auto Model experiments:

The Naive Bayes model would be preferred over tree based models if precision is of paramount importance in the business
The Logistic Regression and GLM would be preferred if accuracy and F measure are the key business targets
The XGBoost model would be overlooked if short run-time is a key business consideration
While tree-based methods are easiest to understand, they tended to over-fit. A perfect recall rate (100%) sounds a bit too optimistic. Nonetheless, these model are attractive because being non-parametric, they are not sensitive to statistical assumptions (e.g. independence and non-correlation among predictor variables). In deed they can be used in large data-sets where they may outshine other models.

Lastly, we note that any data issues due to class imbalance among predictor variables or internal frequency distributions were not addressed in our study. Different models have strengths and shortcomings which must be fully understood, before committing to any single ML model.

CONCLUSIONS

This blog has shown that the RapidMiner Auto ML:

data analytical tool is easy to learn and use. It has pull-down, point and click menus and well designed process flow windows;
yields reliable results (e.g. on customer churn, similar to those previously reported on the same data-set using Stata and R programming language);
quickly yields serious models which can be extensively evaluated, all without any programming.

RapidMiner Auto ML:

can model specialized market niches or business segments through data clustering; and
has a Simulator can be used to model “what if” business scenarios (by drilling down and manually fine-tune hyper-parameters to achieve desired outcomes).

BIBLIOGRAPHY

H. Lan (2017): Decision Trees and Random Forests for Classification and Regression pt.1 – https://towardsdatascience.com/decision-trees-and-random-forests-for-classification-and-regression-pt-1-dbb65a458df

J.G. Nyakuengama (2018 a): Survival Data Analysis And Visualization In Stata – Part 1 – https://dat-analytics.net/2018/07/22/survival-data-analysis-and-visualization-in-stata-part-1/

J.G. Nyakuengama (2018 b): Predictive Data Analysis And Visualization In Stata – Part 1: Logistic Regression – https://dat-analytics.net/2018/07/25/predictive-data-analysis-and-visualization-in-stata-part-1-logistic-regression/

RapidMiner Studio: https://rapidminer.com/products/studio/

Saito T.; Rehmsmeier M. (2015): The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets – https://doi.org/10.1371/journal.pone.0118432

L. Oldja (2018): Survival Analysis to Explore Customer Churn in Python https://towardsdatascience.com/survival-analysis-in-python-a-model-for-customer-churn-e737c5242822

Treselle Engineering (2018): Customer Churn – Logistic Regression with R http://www.treselle.com/blog/customer-churn-logistic-regression-with-r/

S. Li (2017): Predict Customer Churn with R https://towardsdatascience.com/predict-customer-churn-with-r-9e62357d47b4

PREDICTIVE DATA ANALYSIS AND VISUALIZATION IN STATA – PART 1: LOGISTIC REGRESSION

July 25, 2018July 25, 2018 predictivedatanalytics

By Dr Gwinyai Nyakuengama

(25 July 2018)

INTRODUCTION

Welcome to our Stata blog!

The point of this blog job is to have fun and to showcase the powerful Stata capabilities for logistic regression and data visualization.

KEY WORDS

Stata; Logistic Regression; Modelling; Receiver Operator Curve (ROC); Specificity; Sensitivity; Customer Churn; Model performance matrix; Cross-validation; Accuracy.

Research questions

In the last blog, we presented Survival Data Analysis models in Stata for studying time-to-events in tel-co customers, namely churning.

In this blog, we will continue to take advantage of Stata’s expansive data analysis and visualization capabilities to further study the customer characteristics and service history as determinants of churning.

We will attempt to answer the following operational business questions:

What are the key determinants of service churning, from a customer’s perspective?
Could relative importance of those determinants be ranked?
How reliable can these factors be estimated? How stable are they? What are the shortfalls of such approaches?

METHOD

In this blog, we used the same dataset previously described in the last blog on Survival Data Analysis in Stata as follows.

We imported a csv file into Stata version 15, as described before. We built a logistic regression model with the response variable churning presented as a binary variable with a yes/no response, tested performance and reported the results. We also fitted a validated logistic regression model using half of the dataset to train and the other half to test the model.

RESULTS

Fit a high level regression model

Stata command: logistic b_churn SEX SENIORCITIZEN PARTNERED DEPENDENT MULTIPLELINES CONTRACT PAPERLESS TENURE_GROUPS, nolog

Interpretation

Results suggest that:

the fitted regression model was statistically significant, judging by the (Prob>chi2 =0.000)
all predictor variables, but sex and partnered, were highly significant in determining the risk to churn
statistically, significant odds ratios greater than one suggest that:
- customers using paper based transactions were twice likely to churn compared to paperless transactions
- customers with multiple lines were also nearly twice likely to churn compared to those with single lines
- senior citizens were 1.6 times more likely to churn than non-senior citizens
statistically, significant odds ratios less than one suggest that:
- customers with longer tenures and with contract and with dependents were less likely to churn

In the main, these results mirrors those reported previously for this dataset by Li (2017) and Treselle Engineering (2018) from a logistic regression model using R programming language.

Test goodness of fit of the model

Stata command: lfit, group(10) table

Interpretation

Results suggest that the fitted model was a good fit, judging the non-significant Prob > chi2 statistic.

Fit a detailed regression model

Stata command: logistic b_churn i.SEX i.SENIORCITIZEN i.PARTNERED i.DEPENDENT i.MULTIPLELINES i.CONTRACT i.PAPERLESS i.TENURE_GROUPS , nolog

Interpretation

Results suggest:

confirm those described above, additionally, we see the relative magnitudes of odds ratios of the components of each predictive variable.
compared to
- month-to-month, the risk to churn decreased the longer the contract
- the 0-12 month tenures, the tendency to churn increased the longer the tenure.

Test goodness of fit of the model

Stata command: lfit, group(10) table

Interpretation

Results suggest that the fitted model was a good fit, judging the non-significant Prob > chi2 statistic.

Test multicollinearity

Stata command: collin b_churn SEX SENIORCITIZEN PARTNERED DEPENDENT MULTIPLELINES CONTRACT PAPERLESS TENURE_GROUPS

Variable	VIF	SQRT VIF	Tolerance	R Squared
b_churn	1.27	1.13	0.7888	0.2112
SEX	1.00	1.00	0.9994	0.0006
SENIORCITIZEN	1.13	1.06	0.8887	0.1113
PARTNERED	1.45	1.20	0.6912	0.3088
DEPENDENTS	1.37	1.17	0.7292	0.2708
MULTIPLELINES	1.20	1.10	0.8334	0.1666
CONTRACT	2.02	1.42	0.4942	0.5058
PAPERLESS	1.11	1.06	0.8982	0.1018
TENURE_GROUPS	2.20	1.48	0.4542	0.5458
Mean VIF	1.42


Eigenval	Cond Index
1 7.2918	1
2 0.9687	2.7436
3 0.7112	3.2021
4 0.5048	3.8005
5 0.1789	6.3843
6 0.1157	7.9376
7 0.0915	8.9248
8 0.0672	10.4135
9 0.0493	12.1597
10 0.0207	18.773

Condition Number	18.773

Eigenvalues & Cond Index computed from scaled raw sscp (w/ intercept)
Det(correlation matrix) 0.2128

Interpretation

Results do not suggest serious multicollinearity (also collinearity) issues, since the mean and individual Variance Inflation Factors (VIF) are well below 4.

Note:

A VIF of 1 means that there is no correlation among the k^th predictor and the remaining predictor variables, and hence the variance of b_k is not inflated at all.

Model performance assessment

Receiver Operator Curve (ROC)

Stata command:
predict double xb, xb /// roctab b_churn xb

Interpretation

Results suggest that the fitted logistic model correctly classified churning / non-churning cases with an overall accuracy of 78%. While statistical methods are usually not directly comparable between studies, this current result closely mirrors those previously reported for this dataset by Li (2017) and Treselle Engineering (2018). These scholars used R programming language to fit a logistic regression.

Notes:

True-positive rate is also known as Sensitivity, recall or probability of detection.

True-negative rate is also known as Specificity. It measures the proportion of actual negatives that are correctly identified.

Sensitivity / Specificity analysis vs Probability cut-off

Stata command: lsens

Notes:

The probability cut-off point determines the sensitivity (fraction of true positives to all with churning) and specificity (fraction of true negatives to all without churning).

Receiver Operator Curve analysis

Stata command:

roctab b_churn xb /// roctab b_churn xb , graph // with graph

Interpretation

The above results suggest that our logistic regression model was good at picking out churners, judging by its area under the ROC curve of 81%.

Statistics around the ROC estimate are shown in the accompanying table, above.

Notes:

The Receiver Operator Curve (ROC) is a graphical plot that illustrates the diagnostic ability of a binary classifier system, in our case the logistic regression, as its discrimination threshold is varied.

Predictive margins

This section shows the predictive margin statistics and plots for predictor variables used in our logistic regression model. Most importantly, we use the margins to get the predicted probabilities of customers to churn on account of the predictor variables.

Stata command: margins SENIORCITIZEN /// marginsplot

T11

Interpretation

Results suggest that if the distribution of churning remained the same in the population, but everyone was not a senior citizen, we would expect about 25% to churn. If everyone were senior citizens; 33% – which effectively means the latter group were more likely to churn.

Stata command: margins SEX ///
marginsplot, xdimension(SEX)

T444

Interpretation

Results suggest that if the distribution of churning remained the same in the population, but everyone was female, we would expect about 27% to churn. If everyone were male; 26% – which effectively means no gender effect on probability to churn.

Stata command: margins PARTNERED///
marginsplot, xdimension(PARTNERED)

T12

Interpretation

Results suggest that if the distribution of churning remained the same in the population, but everyone was not partnered, we would expect about 26% to churn. If everyone were partnered; 27% – which effectively means no partner effect on probability to churn.

Stata command: margins PAPERLESS///
marginsplot, xdimension(PAPERLESS)

T16

Interpretation

Results suggest that if the distribution of churning remained the same in the population, but everyone was not on paperless plan, we would expect about 20% to churn. If everyone were on a paperless plan; 30% – which effectively means more would churn if on a paperless plan.

Stata command: margins DEPENDENTS ///
marginsplot, xdimension(DEPENDENTS)

T13

Interpretation

Results suggest that if the distribution of churning remained the same in the population, but everyone had no dependents, we would expect about 28% to churn. If everyone were on a paperless plan; 23% – which effectively means fewer would churn if they had dependents.

Stata command: margins MULTIPLELINES ///
marginsplot, xdimension(MULTIPLELINES)

T14

Interpretation

Results suggest that if the distribution of churning remained the same in the population, but everyone did not have multiple-lines, we would expect about 23% to churn. If everyone multiple-lines; 32% – which effectively more would churn if they had had multiple-lines.

Stata command: margins TENURE_GROUPS ///
marginsplot, xdimension(TENURE_GROUPS)

T17

Interpretation

Results suggest that if the distribution of churning remained the same in the population, but everyone were on short tenures (0-12 month), we would expect about 38% to churn. If everyone had longer and longer tenures, we would see that the propensity to churn would progressively decrease – down to 15% in customers with tenure longer than 60 months.

Cross-validated model

Interpretation

Results from a cross-validated logistic regression model yielded similar results to the full model (ROC = 81%) . No further analysis was required.

Notes:

Cross validation was performed using a user-written Stata do file called CrossVal (see https://github.com/MIT-LCP/aline-mimic-ii/blob/master/Data_Analysis/STATA/crossval.ado ).

CONCLUSIONS

In this blog, we conclude that:

Overall the key determinants of customer service churning were tenure group, paperless, multiple-lines plans, contract type, senior citizen status and having dependents. Gender and partnership status had no influence on the likelihood to churn, in this study.
To echo the words of Nyakuengama (2017), “Stata is a state-of-the-art tool-of-choice which facilitates timely, efficient, accurate and trusted evidence-based decision making”:
- Not only is Stata syntax consistent and simple to use to perform logistic regressions;
- Stata is methodologically are rigorous and is backed up by model validation and post-estimation tests.
- Current logistic regression results from Stata were reliable – accuracy of 78% and area under ROC of 81%.
Results from this blog closely matched those reported by Li (2017) and Treselle Engineering (2018) and who separately used R programming to study churning in the same dataset used here.

FUTURE BLOGS

In this short blog, we had fun and demonstrated the benefits of using Stata to undertake rigorous logistic regression and, more importantly, provided further insights into customer churning.

Nonetheless, further insights may be obtainable when the structure and order within the dataset are also considered. There seems to be a logical hierarchy and / or sub-grouping of personal customer characteristics, their access types, service types and payment types. There may even be interactions between these.

In our future blogs we will try to investigate these issues using more sophisticated and advanced regression techniques now available in Stata version 15.

BIBLIOGRAPHY
J.G. Nyakuengama (2017): Stata A Key Strategic Statistical tool-of-choice in major impact evaluations of socioeconomic programs. 2017 Oceania Stata Users Group Meeting https://www.stata.com/meeting/oceania17/slides/oceania17_Nyakuengama.pdf

L. Oldja (2018): Survival Analysis to Explore Customer Churn in Python https://towardsdatascience.com/survival-analysis-in-python-a-model-for-customer-churn-e737c5242822

Treselle Engineering (2018): Customer Churn – Logistic Regression with R http://www.treselle.com/blog/customer-churn-logistic-regression-with-r/

S. Li (2017): Predict Customer Churn with R https://towardsdatascience.com/predict-customer-churn-with-r-9e62357d47b4

SURVIVAL DATA ANALYSIS AND VISUALIZATION IN STATA – PART 1

July 22, 2018July 26, 2018 predictivedatanalytics

By Dr Gwinyai Nyakuengama

(21 July 2018)

KEYWORDS

Stata; Survival Data Analysis; Kaplan-Meier; Cox Proportional Hazard Regression; Nelson-Aalen; Life table; Churn

INTRODUCTION

Welcome to our Stata blog!

The point of this blog job is to have fun and to showcase the powerful Stata capabilities for survival data analysis and data visualization.

What is survival data analysis?

Survival data analysis is widely used in which the time until the event is of interest. Data is often censored or truncated. This, among other things, precludes the use of OLS from survival data analysis.

How is this related to customer churning?

Customer churning is when the customer service ceases.
When the customer service churns is the event of interest in survival data analysis.

RESEARCH QUESTIONS

In this blog we answer the following questions;

What customer information is useful to determine the likelihood of a customer to churn?
Can be Stata reliably used for survival data analysis and visualization of customer churning data?

METHOD

We examined customer service history to understand customer churning partners. We note that the same data was used previously to study customer churning by other scholars using R or Python programming languages, namely:

Li (2017);
Oldja (2018);
rdata.lu Blog | Data science with R (2018); and
Treselle Engineering (2018).

For this blog, we sourced data from the survival data study by Oldja (2018). We are sincerely grateful to WatsonAnalytics team and Dr Lauren Oldja for her Python source code which we ran (see https://github.com/loldja/loldja.github.io/blob/master/assets/code/blog/Kaplan%20Meier%20demo.ipynb).

In our case, we exported the resulting dataset as a csv file for use in Stata. The Stata do file at the end of this blog is about the csv data importation, data cleansing, data exploration and survival data analysis.

About the data

The data set includes information about:

Customers who left within the last month – the column is called Churn
Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
Demographic info about customers – gender, age range, and if they have partners and dependents
To source the original customer dataset follow these instructions:

Go to https://community.watsonanalytics.com/resources/
Download the Telco Customer Churn sample data file.
In Watson Analytics, tap Add and upload Telco Customer Churn.
The filename is a bit longer: WA_Fn-UseC_-Telco-Customer-Churn.csv.

We now present the key survival data analysis results below.

RESULTS
Declare survival data

Stata command: stset tenure, failure(churner)

Table 1 - stset tenure - failure

Interpretation

This shows:

7,043 subjects originally but only 7,032 subjects were used in the analysis.
1,869 subjects churned during the analysis time
227,990 months in made up the total analysis time.

Summarize survival dataset

Stata command: stsum

Interpretation

This shows:

7,032 subjects with an incidence rate of churning of 0.008 (i.e. with 1,869 churners out of the total)

Describe survival dataset

Stata command: stdescribe

Interpretation

This again shows the characteristics of the survival dataset:

7032 subjects with 1,869 churners.
Statistics per subject are also shown.

List cases

Stata command: sts list in 1/10

Table 4 - st list

Interpretation

This shows results of the first 10 customers – by analysis time (Time):

Numbers at the start of each stage (Beg. Total)
Churners (Fail)
Censored customers (Net Fail)
Kaplan-Meier function (Survivor Function)
The 95% Confidence Intervals

EXPLORING THE SURVIVAL DATA: UNIVARIATE ANALYSES
Test for equality of survivor functions

Stata command: sts test b_multiplelines, logrank

Table 5 - sts test - multiplelines

Interpretation

We can see significant differences in the survivor functions of multiple line subclasses.

Stata command: sts graph, by(Multiplelines) ci risktable ytitle(Survival probabilities)

Fig 1 - sts graph by multiplelines

Interpretation

We can see that 1 in 4 users have churned by month 25 of those who have only one phone line. By comparison, 1 in 4 users churn by month 43 among those with multiple phone lines, for a difference of 18 months (an extra 1.5 years of revenue!), as reported by Dr Lauren Oldja.

We also see the counts of customers at risk of churning by analysis times, as well as the 95 % Confidence Intervals of the survival probabilities.

Stata command: sts test PAYMENTTYPE, logrank

Table 6 - sts test - Paymenttype

Interpretation

We can see significant differences in the survivor functions of account payment types or subclasses.

Stata command: sts graph, by(PAYMENTTYPE) ytitle(Survival probabilities) legend(position(6) rows(4))

Fig 2 - sts graph by paymenttypes

Interpretation

We see that customers using the check payment types (particularly electronic) were the most at risk of churning.

Stata command: sts test DEPENDENTS, logrank

Table-7-sts-test-dependents-3691661728-1532502388832.png

Interpretation

We can see significant differences in the survivor functions of dependent subclasses.

Stata command: sts graph, by(DEPENDENTS) ci risktable ytitle(Survival probabilities) legend(position(6) rows(4))

Fig 3 - sts graph by dependents

Interpretation

One in 4 of customers without dependents were most likely to have churned by 20 months. One had to wait over 60 months to observe similar churning percent among those with dependents!

We also see the counts of customers at risk of churning by analysis times, as well as the 95 % Confidence Intervals of the survival probabilities.

SEMI-PARAMETRIC MODEL BUILDING
Fit a Cox Proportional Hazard Regression Model

Stata command: stcox i.SEX i.Multiplelines i.PAYMENTTYPE i.PARTNER i.DEPENDENTS i.CONTRACT i.SENIORCITIZEN

Interpretation

Results suggest that the likelihood to churn, as indicated by the hazard ratios:

is the same between females and males;
is lower in multiple-line than single-line holders;
is higher in electronic- and mailed check-customers and lower in credit transfer-customers, compared to the bank transfer- customers;
is lower in partnered customers than in singles;
is lower if a customer has dependents;
decreases with increasing contract duration; and
is the same, regardless of senior citizen status.

CONCLUSION
In this short blog we:

(a) presented exploratory Stata results from a survival data analysis of customer churning;

(b) presented statistical results and visualizations in Stata from a semi-parametric survival data model, Cox Proportional Hazard Regression;

(c) were also able, not only to replicate in Stata most of the previously reported by other scholars using Python and R programming open-source languages, but provide additional insights on churning customers from the dataset – mainly from survival data analysis point of view; and

(d) showed how easy it is to produce insightful data visualizations in Stata, with minimum code changes.

In our next blog we will build on this story and present results from a parametric survival analysis in Stata, complementary to the above-mentioned studies in R and Python programming languages.

Feel free to like us below and follow us on Twitter: @AnalyticsDat https://twitter.com/AnalyticsDat

If you need Stata assistance, please do not contact by email: DatAnalytics@iinet.com.au

For Stata software purchases please contact:

Our strategic business partner, Survey Design and Analysis Services: https://surveydesign.com.au/ or
Stata: https://www.stata.com/

BIBLOGRAPHY
L. Oldja (2018): Survival Analysis to Explore Customer Churn in Python https://towardsdatascience.com/survival-analysis-in-python-a-model-for-customer-churn-e737c5242822

Treselle Engineering (2018): Customer Churn – Logistic Regression with R http://www.treselle.com/blog/customer-churn-logistic-regression-with-r/

S. Li (2017): Predict Customer Churn with R https://towardsdatascience.com/predict-customer-churn-with-r-9e62357d47b4

rdata.lu Blog | Data science with R (2018): Churn Analysis – Part 1: Model Selection https://www.r-bloggers.com/churn-analysis-part-1-model-selection/

STATA do file
* Purpose: Visualized Stata Survival Analysis

*Author: Dr Gwinyai Nyakuengama

*Date: 21 July 2018

*Web page : DatAnalytics; https://dat-analytics.net/

*Version: 1

*Import data

clear all

set mat 11000

set more off

import delimited “C:\Users\MyStudio\Desktop\python Kaplan Meier demo\Churners_data_for_Stata.csv”, clear

save “C:\Users\MyStudio\Desktop\python Kaplan Meier demo\Churners_data_for_Stata.dta”, replace

* Data cleaning

* name change

rename b_churn churner

* encoding variables

encode gender, gen(SEX)

encode partner, gen(PARTNER)

encode dependents, gen(DEPENDENTS)

encode paymentm, gen(PAYMENTTYPE)

encode contract, gen(CONTRACT)

* recode Multiplelines

* First we define those labels

label define plines 1 “Yes” 0 “No” // defines this “label”

* Then we attach the value label

label values b_multiplelines plines // apply the label

rename b_multiplelines Multiplelines

* Tabulate results

tab1 Multiplelines // table with label

tab1 Multiplelines , nolabel // table without the label

* recode SENIORCITIZEN

* first we define those labels

label define seniorz 1 “Yes” 0 “No” // defines this “label”

* Then we attach the value label

label values seniorcitizen seniorz // apply the label

rename seniorcitizen SENIORCITIZEN

* Tabulate results

tab1 Multiplelines // table with label

tab1 Multiplelines , nolabel // table without the label

* declare survival dataset

stset tenure, failure(churner)

* summarize & describe survival dataset

stdescribe

stsum

ds // variable name

sts list in 1/10

*fit Cox PH regression

stcox i.SEX i.Multiplelines i.PAYMENTTYPE i.PARTNER i.DEPENDENTS i.CONTRACT i.SENIORCITIZEN

* do log-rank tests and sts graphs

sts test Multiplelines, logrank

* graph KM

sts graph, by(Multiplelines) ci risktable ytitle(Survival probabilities)

sts test PAYMENTTYPE, logrank

* graph KM

sts graph, by(PAYMENTTYPE) ytitle(Survival probabilities) legend(position(6) cols(1)rows(6) )

sts test PARTNER, logrank

* graph KM

sts graph, by(DEPENDENTS) ci risktable ytitle(Survival probabilities) legend(position(6) rows(4))