By Dr Gwinyai Nyakuengama
(21 July 2018)
KEYWORDS
Stata; Survival Data Analysis; Kaplan-Meier; Cox Proportional Hazard Regression; Nelson-Aalen; Life table; Churn
INTRODUCTION
Welcome to our Stata blog!
The point of this blog job is to have fun and to showcase the powerful Stata capabilities for survival data analysis and data visualization.
What is survival data analysis?
- Survival data analysis is widely used in which the time until the event is of interest. Data is often censored or truncated. This, among other things, precludes the use of OLS from survival data analysis.
How is this related to customer churning?
- Customer churning is when the customer service ceases.
- When the customer service churns is the event of interest in survival data analysis.
RESEARCH QUESTIONS
In this blog we answer the following questions;
- What customer information is useful to determine the likelihood of a customer to churn?
- Can be Stata reliably used for survival data analysis and visualization of customer churning data?
METHOD
We examined customer service history to understand customer churning partners. We note that the same data was used previously to study customer churning by other scholars using R or Python programming languages, namely:
- Li (2017);
- Oldja (2018);
- rdata.lu Blog | Data science with R (2018); and
- Treselle Engineering (2018).
For this blog, we sourced data from the survival data study by Oldja (2018). We are sincerely grateful to WatsonAnalytics team and Dr Lauren Oldja for her Python source code which we ran (see https://github.com/loldja/loldja.github.io/blob/master/assets/code/blog/Kaplan%20Meier%20demo.ipynb).
In our case, we exported the resulting dataset as a csv file for use in Stata. The Stata do file at the end of this blog is about the csv data importation, data cleansing, data exploration and survival data analysis.
About the data
The data set includes information about:
Customers who left within the last month – the column is called Churn
Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
Demographic info about customers – gender, age range, and if they have partners and dependents
To source the original customer dataset follow these instructions:
- Go to https://community.watsonanalytics.com/resources/
- Download the Telco Customer Churn sample data file.
- In Watson Analytics, tap Add and upload Telco Customer Churn.
The filename is a bit longer: WA_Fn-UseC_-Telco-Customer-Churn.csv.
We now present the key survival data analysis results below.
RESULTS
Declare survival data
Stata command: stset tenure, failure(churner)
Interpretation
This shows:
- 7,043 subjects originally but only 7,032 subjects were used in the analysis.
- 1,869 subjects churned during the analysis time
- 227,990 months in made up the total analysis time.
Summarize survival dataset
Stata command: stsum
Interpretation
This shows:
- 7,032 subjects with an incidence rate of churning of 0.008 (i.e. with 1,869 churners out of the total)
Describe survival dataset
Stata command: stdescribe
Interpretation
This again shows the characteristics of the survival dataset:
- 7032 subjects with 1,869 churners.
- Statistics per subject are also shown.
List cases
Stata command: sts list in 1/10
Interpretation
This shows results of the first 10 customers – by analysis time (Time):
- Numbers at the start of each stage (Beg. Total)
- Churners (Fail)
- Censored customers (Net Fail)
- Kaplan-Meier function (Survivor Function)
- The 95% Confidence Intervals
EXPLORING THE SURVIVAL DATA: UNIVARIATE ANALYSES
Test for equality of survivor functions
Stata command: sts test b_multiplelines, logrank
Interpretation
We can see significant differences in the survivor functions of multiple line subclasses.
Stata command: sts graph, by(Multiplelines) ci risktable ytitle(Survival probabilities)
Interpretation
We can see that 1 in 4 users have churned by month 25 of those who have only one phone line. By comparison, 1 in 4 users churn by month 43 among those with multiple phone lines, for a difference of 18 months (an extra 1.5 years of revenue!), as reported by Dr Lauren Oldja.
We also see the counts of customers at risk of churning by analysis times, as well as the 95 % Confidence Intervals of the survival probabilities.
Stata command: sts test PAYMENTTYPE, logrank
Interpretation
We can see significant differences in the survivor functions of account payment types or subclasses.
Stata command: sts graph, by(PAYMENTTYPE) ytitle(Survival probabilities) legend(position(6) rows(4))
Interpretation
We see that customers using the check payment types (particularly electronic) were the most at risk of churning.
Stata command: sts test DEPENDENTS, logrank
Interpretation
We can see significant differences in the survivor functions of dependent subclasses.
Stata command: sts graph, by(DEPENDENTS) ci risktable ytitle(Survival probabilities) legend(position(6) rows(4))
Interpretation
One in 4 of customers without dependents were most likely to have churned by 20 months. One had to wait over 60 months to observe similar churning percent among those with dependents!
We also see the counts of customers at risk of churning by analysis times, as well as the 95 % Confidence Intervals of the survival probabilities.
SEMI-PARAMETRIC MODEL BUILDING
Fit a Cox Proportional Hazard Regression Model
Stata command: stcox i.SEX i.Multiplelines i.PAYMENTTYPE i.PARTNER i.DEPENDENTS i.CONTRACT i.SENIORCITIZEN
Interpretation
Results suggest that the likelihood to churn, as indicated by the hazard ratios:
- is the same between females and males;
- is lower in multiple-line than single-line holders;
- is higher in electronic- and mailed check-customers and lower in credit transfer-customers, compared to the bank transfer- customers;
- is lower in partnered customers than in singles;
- is lower if a customer has dependents;
- decreases with increasing contract duration; and
- is the same, regardless of senior citizen status.
CONCLUSION
In this short blog we:
(a) presented exploratory Stata results from a survival data analysis of customer churning;
(b) presented statistical results and visualizations in Stata from a semi-parametric survival data model, Cox Proportional Hazard Regression;
(c) were also able, not only to replicate in Stata most of the previously reported by other scholars using Python and R programming open-source languages, but provide additional insights on churning customers from the dataset – mainly from survival data analysis point of view; and
(d) showed how easy it is to produce insightful data visualizations in Stata, with minimum code changes.
In our next blog we will build on this story and present results from a parametric survival analysis in Stata, complementary to the above-mentioned studies in R and Python programming languages.
Feel free to like us below and follow us on Twitter: @AnalyticsDat https://twitter.com/AnalyticsDat
If you need Stata assistance, please do not contact by email: DatAnalytics@iinet.com.au
For Stata software purchases please contact:
- Our strategic business partner, Survey Design and Analysis Services: https://surveydesign.com.au/ or
- Stata: https://www.stata.com/
BIBLOGRAPHY
L. Oldja (2018): Survival Analysis to Explore Customer Churn in Python https://towardsdatascience.com/survival-analysis-in-python-a-model-for-customer-churn-e737c5242822
Treselle Engineering (2018): Customer Churn – Logistic Regression with R http://www.treselle.com/blog/customer-churn-logistic-regression-with-r/
S. Li (2017): Predict Customer Churn with R https://towardsdatascience.com/predict-customer-churn-with-r-9e62357d47b4
rdata.lu Blog | Data science with R (2018): Churn Analysis – Part 1: Model Selection https://www.r-bloggers.com/churn-analysis-part-1-model-selection/
STATA do file
* Purpose: Visualized Stata Survival Analysis
*Author: Dr Gwinyai Nyakuengama
*Date: 21 July 2018
*Web page : DatAnalytics; https://dat-analytics.net/
*Version: 1
*Import data
clear all
set mat 11000
set more off
import delimited “C:\Users\MyStudio\Desktop\python Kaplan Meier demo\Churners_data_for_Stata.csv”, clear
save “C:\Users\MyStudio\Desktop\python Kaplan Meier demo\Churners_data_for_Stata.dta”, replace
* Data cleaning
* name change
rename b_churn churner
* encoding variables
encode gender, gen(SEX)
encode partner, gen(PARTNER)
encode dependents, gen(DEPENDENTS)
encode paymentm, gen(PAYMENTTYPE)
encode contract, gen(CONTRACT)
* recode Multiplelines
* First we define those labels
label define plines 1 “Yes” 0 “No” // defines this “label”
* Then we attach the value label
label values b_multiplelines plines // apply the label
rename b_multiplelines Multiplelines
* Tabulate results
tab1 Multiplelines // table with label
tab1 Multiplelines , nolabel // table without the label
* recode SENIORCITIZEN
* first we define those labels
label define seniorz 1 “Yes” 0 “No” // defines this “label”
* Then we attach the value label
label values seniorcitizen seniorz // apply the label
rename seniorcitizen SENIORCITIZEN
* Tabulate results
tab1 Multiplelines // table with label
tab1 Multiplelines , nolabel // table without the label
* declare survival dataset
stset tenure, failure(churner)
* summarize & describe survival dataset
stdescribe
stsum
ds // variable name
sts list in 1/10
*fit Cox PH regression
stcox i.SEX i.Multiplelines i.PAYMENTTYPE i.PARTNER i.DEPENDENTS i.CONTRACT i.SENIORCITIZEN
* do log-rank tests and sts graphs
sts test Multiplelines, logrank
* graph KM
sts graph, by(Multiplelines) ci risktable ytitle(Survival probabilities)
sts test PAYMENTTYPE, logrank
* graph KM
sts graph, by(PAYMENTTYPE) ytitle(Survival probabilities) legend(position(6) cols(1)rows(6) )
sts test PARTNER, logrank
* graph KM
sts graph, by(DEPENDENTS) ci risktable ytitle(Survival probabilities) legend(position(6) rows(4))