SURVIVAL DATA ANALYSIS AND VISUALIZATION IN STATA – PART 1

 

By Dr Gwinyai Nyakuengama

(21 July 2018)

 

KEYWORDS

Stata; Survival Data Analysis; Kaplan-Meier; Cox Proportional Hazard Regression; Nelson-Aalen; Life table; Churn

 

INTRODUCTION

Welcome to our Stata blog!

The point of this blog job is to have fun and to showcase the powerful Stata capabilities for survival data analysis and data visualization.

What is survival data analysis?

  • Survival data analysis is widely used in which the time until the event is of interest. Data is often censored or truncated. This, among other things, precludes the use of OLS from survival data analysis.

How is this related to customer churning?

  • Customer churning is when the customer service ceases.
  • When the customer service churns is the event of interest in survival data analysis.

 

RESEARCH QUESTIONS

In this blog we answer the following questions;

  • What customer information is useful to determine the likelihood of a customer to churn?
  • Can be Stata reliably used for survival data analysis and visualization of customer churning data?

 

METHOD

We examined customer service history to understand customer churning partners. We note that the same data was used previously to study customer churning by other scholars using R or Python programming languages, namely:

  • Li (2017);
  • Oldja (2018);
  • rdata.lu Blog | Data science with R (2018); and
  • Treselle Engineering (2018).

 

For this blog, we sourced data from the survival data study by Oldja (2018).  We are sincerely grateful to WatsonAnalytics team and Dr Lauren Oldja for her Python source code which we ran (see https://github.com/loldja/loldja.github.io/blob/master/assets/code/blog/Kaplan%20Meier%20demo.ipynb).

In our case, we exported the resulting dataset as a csv file for use in Stata. The Stata do file at the end of this blog is about the csv data importation, data cleansing, data exploration and survival data analysis.

 

About the data

The data set includes information about:

Customers who left within the last month – the column is called Churn
Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
Demographic info about customers – gender, age range, and if they have partners and dependents
To source the original customer dataset follow these instructions:

  • Go to https://community.watsonanalytics.com/resources/
  • Download the Telco Customer Churn sample data file.
  • In Watson Analytics, tap Add and upload Telco Customer Churn.
    The filename is a bit longer: WA_Fn-UseC_-Telco-Customer-Churn.csv.

We now present the key survival data analysis results below.

 

RESULTS
Declare survival data

Stata command: stset tenure, failure(churner)

Table 1 - stset tenure - failure

Interpretation

This shows:

  • 7,043 subjects originally but only 7,032 subjects were used in the analysis.
  • 1,869 subjects churned during the analysis time
  • 227,990 months in made up the total analysis time.

 

 

Summarize survival dataset

Stata command: stsum

Table-3-stdescribe.png

 

Interpretation

This shows:

  • 7,032 subjects with an incidence rate of churning of 0.008 (i.e. with 1,869 churners out of the total)

 

Describe survival dataset

Stata command: stdescribe

Table-2-st-sum.png

 

 

Interpretation

This again shows the characteristics of the survival dataset:

  • 7032 subjects with 1,869 churners.
  • Statistics per subject are also shown.

 

List cases

Stata command: sts list in 1/10

Table 4 - st list

 

Interpretation

This shows results of the first 10 customers – by analysis time (Time):

  • Numbers at the start of each stage (Beg. Total)
  • Churners (Fail)
  • Censored customers (Net Fail)
  • Kaplan-Meier function (Survivor Function)
  • The 95% Confidence Intervals

 

 

EXPLORING THE SURVIVAL DATA: UNIVARIATE ANALYSES
Test for equality of survivor functions

Stata command: sts test b_multiplelines, logrank

Table 5 - sts test - multiplelines

 

Interpretation

We can see significant differences in the survivor functions of multiple line subclasses.

 

Stata command: sts graph, by(Multiplelines) ci risktable ytitle(Survival probabilities)

Fig 1 - sts graph by multiplelines

 

Interpretation

We can see that 1 in 4 users have churned by month 25 of those who have only one phone line. By comparison, 1 in 4 users churn by month 43 among those with multiple phone lines, for a difference of 18 months (an extra 1.5 years of revenue!), as reported by Dr Lauren Oldja.

We also see the counts of customers at risk of churning by analysis times, as well as the 95 % Confidence Intervals of the survival probabilities.

 

Stata command: sts test PAYMENTTYPE, logrank

Table 6 - sts test - Paymenttype

 

Interpretation

We can see significant differences in the survivor functions of account payment types or subclasses.

 

Stata command: sts graph, by(PAYMENTTYPE) ytitle(Survival probabilities) legend(position(6) rows(4))

Fig 2 - sts graph by paymenttypes

 

Interpretation

We see that customers using the check payment types (particularly electronic) were the most at risk of churning.

 

Stata command: sts test DEPENDENTS, logrank

Table-7-sts-test-dependents-3691661728-1532502388832.png

 

 

Interpretation

We can see significant differences in the survivor functions of dependent subclasses.

 

Stata command: sts graph, by(DEPENDENTS) ci risktable ytitle(Survival probabilities) legend(position(6) rows(4))

Fig 3 - sts graph by dependents

 

Interpretation

One in 4 of customers without dependents were most likely to have churned by 20 months. One had to wait over 60 months to observe similar churning percent among those with dependents!

We also see the counts of customers at risk of churning by analysis times, as well as the 95 % Confidence Intervals of the survival probabilities.

 

SEMI-PARAMETRIC MODEL BUILDING
Fit a Cox Proportional Hazard Regression Model

Stata command: stcox i.SEX i.Multiplelines i.PAYMENTTYPE   i.PARTNER i.DEPENDENTS i.CONTRACT i.SENIORCITIZEN

Table-8-stcox-ph-regression-results.png

 

Interpretation

Results suggest that the likelihood to churn, as indicated by the hazard ratios:

  • is the same between females and males;
  • is lower in multiple-line than single-line holders;
  • is higher in electronic- and mailed check-customers and lower in credit transfer-customers, compared to the bank transfer- customers;
  •  is lower in partnered customers than in singles;
  • is lower if a customer has dependents;
  • decreases with increasing contract duration; and
  • is the same, regardless of senior citizen status.

 

CONCLUSION
In this short blog we:

(a)    presented exploratory Stata results from a survival data analysis of customer churning;

(b)   presented statistical results and visualizations in Stata from a semi-parametric survival data model, Cox Proportional Hazard Regression;

(c)    were also able, not only to replicate in Stata most of the previously reported by other scholars using Python and R programming open-source languages, but provide additional insights on churning customers from the dataset – mainly from survival data analysis point of view; and

(d)   showed how easy it is to produce insightful data visualizations in Stata, with minimum code changes.

 

In our next blog we will build on this story and present results from a parametric survival analysis in Stata, complementary to the above-mentioned studies in R and Python programming languages.

 

Feel free to like us below and follow us on Twitter: @AnalyticsDat  https://twitter.com/AnalyticsDat

If you need Stata assistance, please do not contact by email: DatAnalytics@iinet.com.au

 

For Stata software purchases please contact:

 

 

BIBLOGRAPHY
L. Oldja (2018): Survival Analysis to Explore Customer Churn in Python https://towardsdatascience.com/survival-analysis-in-python-a-model-for-customer-churn-e737c5242822

Treselle Engineering (2018): Customer Churn – Logistic Regression with R http://www.treselle.com/blog/customer-churn-logistic-regression-with-r/

S. Li (2017): Predict Customer Churn with R https://towardsdatascience.com/predict-customer-churn-with-r-9e62357d47b4

rdata.lu Blog | Data science with R (2018): Churn Analysis – Part 1: Model Selection https://www.r-bloggers.com/churn-analysis-part-1-model-selection/

STATA do file
* Purpose:       Visualized Stata Survival Analysis                        

*Author:          Dr Gwinyai Nyakuengama                                                                 

*Date:              21 July 2018                                                                                              

*Web page :    DatAnalytics; https://dat-analytics.net/      

*Version:         1                                                                                                                                            

*Import data

clear all

set mat 11000

set more off

import delimited “C:\Users\MyStudio\Desktop\python Kaplan Meier demo\Churners_data_for_Stata.csv”, clear

save “C:\Users\MyStudio\Desktop\python Kaplan Meier demo\Churners_data_for_Stata.dta”, replace

 

* Data cleaning

             * name change

                          rename b_churn churner 

             * encoding variables

                          encode gender, gen(SEX)

                          encode partner, gen(PARTNER)

                          encode dependents, gen(DEPENDENTS)

                          encode paymentm, gen(PAYMENTTYPE)

                          encode contract, gen(CONTRACT)

             * recode Multiplelines

             * First we define those labels

                          label define plines 1 “Yes” 0 “No” // defines this “label”

             * Then we attach the value label 

                          label values b_multiplelines plines // apply the label

                           rename b_multiplelines Multiplelines

             * Tabulate results

                          tab1 Multiplelines  // table with label

                          tab1 Multiplelines , nolabel // table without the label

             * recode SENIORCITIZEN

             * first we define those labels

                          label define seniorz 1 “Yes” 0 “No” // defines this “label”

             * Then we attach the value label 

                          label values seniorcitizen seniorz // apply the label

                           rename seniorcitizen SENIORCITIZEN

             * Tabulate results

                          tab1 Multiplelines  // table with label

                          tab1 Multiplelines , nolabel // table without the label

 * declare survival dataset

stset tenure, failure(churner)

* summarize & describe survival dataset

stdescribe

stsum

         

ds // variable name

sts list in 1/10

*fit Cox PH regression

stcox i.SEX i.Multiplelines i.PAYMENTTYPE   i.PARTNER i.DEPENDENTS i.CONTRACT i.SENIORCITIZEN

* do log-rank tests and sts graphs

sts test Multiplelines, logrank

* graph KM

sts graph, by(Multiplelines) ci risktable ytitle(Survival probabilities)

sts test PAYMENTTYPE, logrank

* graph KM

sts graph, by(PAYMENTTYPE)   ytitle(Survival probabilities) legend(position(6) cols(1)rows(6) )

sts test PARTNER, logrank

* graph KM

sts graph, by(DEPENDENTS) ci risktable ytitle(Survival probabilities) legend(position(6) rows(4))

 

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.