Site icon DatAnalytics

SURVIVAL DATA ANALYSIS AND VISUALIZATION IN STATA – PART 1

 

By Dr Gwinyai Nyakuengama

(21 July 2018)

 

KEYWORDS

Stata; Survival Data Analysis; Kaplan-Meier; Cox Proportional Hazard Regression; Nelson-Aalen; Life table; Churn

 

INTRODUCTION

Welcome to our Stata blog!

The point of this blog job is to have fun and to showcase the powerful Stata capabilities for survival data analysis and data visualization.

What is survival data analysis?

How is this related to customer churning?

 

RESEARCH QUESTIONS

In this blog we answer the following questions;

 

METHOD

We examined customer service history to understand customer churning partners. We note that the same data was used previously to study customer churning by other scholars using R or Python programming languages, namely:

 

For this blog, we sourced data from the survival data study by Oldja (2018).  We are sincerely grateful to WatsonAnalytics team and Dr Lauren Oldja for her Python source code which we ran (see https://github.com/loldja/loldja.github.io/blob/master/assets/code/blog/Kaplan%20Meier%20demo.ipynb).

In our case, we exported the resulting dataset as a csv file for use in Stata. The Stata do file at the end of this blog is about the csv data importation, data cleansing, data exploration and survival data analysis.

 

About the data

The data set includes information about:

Customers who left within the last month – the column is called Churn
Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
Demographic info about customers – gender, age range, and if they have partners and dependents
To source the original customer dataset follow these instructions:

We now present the key survival data analysis results below.

 

RESULTS
Declare survival data

Stata command: stset tenure, failure(churner)

Interpretation

This shows:

 

 

Summarize survival dataset

Stata command: stsum

 

Interpretation

This shows:

 

Describe survival dataset

Stata command: stdescribe

 

 

Interpretation

This again shows the characteristics of the survival dataset:

 

List cases

Stata command: sts list in 1/10

 

Interpretation

This shows results of the first 10 customers – by analysis time (Time):

 

 

EXPLORING THE SURVIVAL DATA: UNIVARIATE ANALYSES
Test for equality of survivor functions

Stata command: sts test b_multiplelines, logrank

 

Interpretation

We can see significant differences in the survivor functions of multiple line subclasses.

 

Stata command: sts graph, by(Multiplelines) ci risktable ytitle(Survival probabilities)

 

Interpretation

We can see that 1 in 4 users have churned by month 25 of those who have only one phone line. By comparison, 1 in 4 users churn by month 43 among those with multiple phone lines, for a difference of 18 months (an extra 1.5 years of revenue!), as reported by Dr Lauren Oldja.

We also see the counts of customers at risk of churning by analysis times, as well as the 95 % Confidence Intervals of the survival probabilities.

 

Stata command: sts test PAYMENTTYPE, logrank

 

Interpretation

We can see significant differences in the survivor functions of account payment types or subclasses.

 

Stata command: sts graph, by(PAYMENTTYPE) ytitle(Survival probabilities) legend(position(6) rows(4))

 

Interpretation

We see that customers using the check payment types (particularly electronic) were the most at risk of churning.

 

Stata command: sts test DEPENDENTS, logrank

 

 

Interpretation

We can see significant differences in the survivor functions of dependent subclasses.

 

Stata command: sts graph, by(DEPENDENTS) ci risktable ytitle(Survival probabilities) legend(position(6) rows(4))

 

Interpretation

One in 4 of customers without dependents were most likely to have churned by 20 months. One had to wait over 60 months to observe similar churning percent among those with dependents!

We also see the counts of customers at risk of churning by analysis times, as well as the 95 % Confidence Intervals of the survival probabilities.

 

SEMI-PARAMETRIC MODEL BUILDING
Fit a Cox Proportional Hazard Regression Model

Stata command: stcox i.SEX i.Multiplelines i.PAYMENTTYPE   i.PARTNER i.DEPENDENTS i.CONTRACT i.SENIORCITIZEN

 

Interpretation

Results suggest that the likelihood to churn, as indicated by the hazard ratios:

 

CONCLUSION
In this short blog we:

(a)    presented exploratory Stata results from a survival data analysis of customer churning;

(b)   presented statistical results and visualizations in Stata from a semi-parametric survival data model, Cox Proportional Hazard Regression;

(c)    were also able, not only to replicate in Stata most of the previously reported by other scholars using Python and R programming open-source languages, but provide additional insights on churning customers from the dataset – mainly from survival data analysis point of view; and

(d)   showed how easy it is to produce insightful data visualizations in Stata, with minimum code changes.

 

In our next blog we will build on this story and present results from a parametric survival analysis in Stata, complementary to the above-mentioned studies in R and Python programming languages.

 

Feel free to like us below and follow us on Twitter: @AnalyticsDat  https://twitter.com/AnalyticsDat

If you need Stata assistance, please do not contact by email: DatAnalytics@iinet.com.au

 

For Stata software purchases please contact:

 

 

BIBLOGRAPHY
L. Oldja (2018): Survival Analysis to Explore Customer Churn in Python https://towardsdatascience.com/survival-analysis-in-python-a-model-for-customer-churn-e737c5242822

Treselle Engineering (2018): Customer Churn – Logistic Regression with R http://www.treselle.com/blog/customer-churn-logistic-regression-with-r/

S. Li (2017): Predict Customer Churn with R https://towardsdatascience.com/predict-customer-churn-with-r-9e62357d47b4

rdata.lu Blog | Data science with R (2018): Churn Analysis – Part 1: Model Selection https://www.r-bloggers.com/churn-analysis-part-1-model-selection/

STATA do file
* Purpose:       Visualized Stata Survival Analysis                        

*Author:          Dr Gwinyai Nyakuengama                                                                 

*Date:              21 July 2018                                                                                              

*Web page :    DatAnalytics; https://dat-analytics.net/      

*Version:         1                                                                                                                                            

*Import data

clear all

set mat 11000

set more off

import delimited “C:\Users\MyStudio\Desktop\python Kaplan Meier demo\Churners_data_for_Stata.csv”, clear

save “C:\Users\MyStudio\Desktop\python Kaplan Meier demo\Churners_data_for_Stata.dta”, replace

 

* Data cleaning

             * name change

                          rename b_churn churner 

             * encoding variables

                          encode gender, gen(SEX)

                          encode partner, gen(PARTNER)

                          encode dependents, gen(DEPENDENTS)

                          encode paymentm, gen(PAYMENTTYPE)

                          encode contract, gen(CONTRACT)

             * recode Multiplelines

             * First we define those labels

                          label define plines 1 “Yes” 0 “No” // defines this “label”

             * Then we attach the value label 

                          label values b_multiplelines plines // apply the label

                           rename b_multiplelines Multiplelines

             * Tabulate results

                          tab1 Multiplelines  // table with label

                          tab1 Multiplelines , nolabel // table without the label

             * recode SENIORCITIZEN

             * first we define those labels

                          label define seniorz 1 “Yes” 0 “No” // defines this “label”

             * Then we attach the value label 

                          label values seniorcitizen seniorz // apply the label

                           rename seniorcitizen SENIORCITIZEN

             * Tabulate results

                          tab1 Multiplelines  // table with label

                          tab1 Multiplelines , nolabel // table without the label

 * declare survival dataset

stset tenure, failure(churner)

* summarize & describe survival dataset

stdescribe

stsum

         

ds // variable name

sts list in 1/10

*fit Cox PH regression

stcox i.SEX i.Multiplelines i.PAYMENTTYPE   i.PARTNER i.DEPENDENTS i.CONTRACT i.SENIORCITIZEN

* do log-rank tests and sts graphs

sts test Multiplelines, logrank

* graph KM

sts graph, by(Multiplelines) ci risktable ytitle(Survival probabilities)

sts test PAYMENTTYPE, logrank

* graph KM

sts graph, by(PAYMENTTYPE)   ytitle(Survival probabilities) legend(position(6) cols(1)rows(6) )

sts test PARTNER, logrank

* graph KM

sts graph, by(DEPENDENTS) ci risktable ytitle(Survival probabilities) legend(position(6) rows(4))

 

Exit mobile version