Predicting flight classes using unsupervised Machine Learning in Stata: Principal Component Analysis and Discriminant Analysis



By Dr Gwinyai Nyakuengama

(10 January 2019)



Stata commands: pca, discrim logistic, discrim lda, discrim knn and candisc; Unsupervised Machine Learning; Accuracy and Error Rates; Principal Component Analysis; Linear Discriminant Analysis; Dimension Reduction; Eigenvalues; Customer Satisfaction Survey Results.


Predicting flight classes using unsupervised Machine Learning in Stata – Principal Component Analysis and Discriminant Analysis

Using R and RapidMiner Auto Model to rapidly and reliably choose a great red from 40,000 Kaggle wine review texts.

By Dr Gwinyai Nyakuengama

(2 January 2019)



Kaggle Amazon wine reviews; R; word2vec, h2o routine; RapidMiner Auto Model; Automatic Feature Engineering; Supervised Machine Learning Models; Naive Bayes; Generalized Linear Model; Logistic Regression; Deep Learning; Random Forest; Gradient Boosted Trees; Support Vector Machine; Model performance; Receiver Operator Curve;  Confusion Matrix

Using R and RapidMiner to rapidly and reliably choose a good red from 40,000 Kaggle wine review texts -20180102

Stylometry – Authorship Attribution (Early British Fictionists)


by Dr John Gwinyai Nyakuengama

(2 December 2018)



Early British Fictionists; Jane Eyre; Stylometry; Unsupervised Machine Learning;

QDA Miner and WordStat; Python; k-Nearest Neighbour learning method with leave-one-out


Stylometry is the application of the study of linguistic style. It is often used to define the author’s writeprint (Rygl, 2016). Steps in Machine Learning stylometry comprise data acquisition, feature extraction, machine learning through training and testing classifiers and interpretation of results.

In this experiment, we anonymised the book, Jane Eyre by Charlotte Brontë.

We used Python and two Provalis packages, QDA Miner and WordStat, to undertake ML stylometry, that is to identify the correct author of this “mystery/disputed” book, based on corpora (books) written by early British Fictionists, including Charlotte Brontë.

Charlotte Brontë lived from 1816 to 1855. Jane Eyre appeared in 1847 and was followed by Shirley (1848) and Vilette (1853). The Professor was published posthumously in 1857.


Authorship Attribution using Python

We followed the method of Dr François Dominic Laramée (2018):

  • Created individual files of 23 British Fictionists / authors.
  • Adapted his Python codebook and attributed authorship using three methods:
  • Mendenhall’s Characteristic Curves of Composition
  • John Burrows’ Delta Method
  • Kilgariff’s Chi-Squared Method


Authorship Attribution using QDA Miner/WordStat

We followed the method of Dr Normad Peladeau and the QDA Miner and WordStat Users Guides:

  • Created a QDA project with the individual files of the Early British Fictionists.
  • Created and saved a WordStat classification model (*.wclas) for the potential authors.
  • Created a QDA project for the mystery / disputed work.
  • Classified the mystery /disputed corpora using the WordStat model.



Stylometry – Authorship Attribution using Python


These graphs show word counts vs samples for the British fictionists. Clearly, the profile of the “disputed” / mystery author  resembled that of CBronte (Charlotte Bronte) the most.


These results show that the Delta Score of the “disputed” / mystery book were most similar to CBronte.



The Chi-squared statistics for CBronte was much smaller than that of CDickens, a possible writer of the “disputed” / mystery corpora.



Stylometry – Authorship Attribution using QDA Miner/WordStat


This image shows the QDA Miner project of three British fictionists, CharlesDickens, CharlotteBronte and JaneAusten, that were subsequently used to create a Machine Learning classification model in WordStat.


This image shows the creation of a Machine Learning classification model in WordStat. The k-Nearest Neighbour learning method had been used, with a leave-one-out validation.

Results also show good model performance in terms of precision, accuracy and recall. Definitions of these Machine Learning model performance terms were given previously, (see Nyakuengama 2018:



This shows a step in QDA Miner before applying the Machine Learning classification model on the “disputed” / mystery corpora.


This image is after the WordStat Machine Learning classification model had been applied in QDA Miner. Most importantly, it shows that the model had correctly picked Charlotte Bronte as the mystery author of Jane Eyre.


This WordStat Group Dendrogram also suggested that the author of the mystery book (anonymised_test_case) was most likely CharlotteBronte.


This WordStat Correspondence Analysis chart also suggested that the author of the mystery case (anonymised_test_case) was most likely CharlotteBronte.



This short blog, show-cased two Machine Learning stylometric methods implemented using Python and the Provalis packages, QDA Miner and WordStat.

  • Both methods correctly identified Charlotte Brontë as the author of Jane Eyre. Each method yielded useful and complementary information.
  • We note that:
    • Python can handle large corpora and is programmatically more challenging than the Provalis packages.
    • The Provalis packages can just as easily handle several million words and billion tokens. The trick is to have fast computer processors when using them to process large documents / corpora.


In our future blog:

  • Python and the Provalis packages, QDA Miner and WordStat, will be used to undertake more complex Machine Learning of unstructured texts.
  • We may also show-case the mapping and data visualization packages (namely, Tibco Spotfire, Tableau and Power Bi) which we currently use synergistically with other advanced data analytics tools (such as Stata and RapidMiner).




We thank owner of the Early British Fictionists GitHub resource: A_Small_Collection_of_British_Fiction .

We thank Dr Normand Peladeau for QDA Miner and WordStat and his associated webinars:

  • Supervised and Unsupervised Machine Learning Features.
  • Webinar on the New Features of WordStat 8 – Content.


We thank Dr Francois Dominic Laramée for sharing his Python, Jupyter Notebook used in his blog: François Dominic Laramée, “Introduction to stylometry with Python,” The Programming Historian 7 (2018),


We thank  Survey Design and Analysis Services: , vendor of QDA Miner, WordStat and Stata.

We thank Anaconda, distributors of Python and Jupyter Notebook.




Dr Jan RyglPA153: Stylometric analysis of texts using machine learning techniques” NLP Centre, Fac. Informatics, Masaryk University.

Dr Normand Peladeau’s webinars on QDA Miner and WordStat:

  • Supervised and Unsupervised Machine Learning Features
  • Webinar on the New Features of WordStat 8 – Content

Dr François Dominic Laramée, “Introduction to stylometry with Python,” The Programming Historian 7 (2018),

WordStat / QDA Miner Users Manuals







By Dr John Gwinyai Nyakuengama

(20 October 2018)



Chicago City, Divvy Bicycles (Divvy Rides) big dataset from 2014 -2017; Big data analytics, visualization and mapping;  Stata; R; RapidMiner Turbo Prep; Tibco Spotfire; Power Bi; Google Maps



This study analysed the Chicago Divvy rides user transactional, big dataset collected between 2014 to 2017.

It found that over 13.5 Million trips were taken during that period. In 2017 alone, over 590 Divvy ride stations operated over 6,240 individual bikes.

The Chicago Divvy rides users (customers and subscribers) showed two different usage patterns in terms of the:

  • Number of Divvy rides and median trip duration, as well as year-on-year growth patterns;
  • Time of access to Divvy rides (by day of week and by time of day); and
  • Divvy stations from which users had travelled from and to.

The current study identified some big data merits and challenges in the Chicago Divvy rides dataset and show-cased a number of big data analysis, visualization and mapping tools.
































Between 2014 and 2017, about 3.5 Million customers and 9.7 Million subscribers accessed Divvy rides. This means that customers and subscribers comprised about a quarter and three quarters of all Divvy ride users, respectively.


From 2014 to 2017, the number of Divvy ride customers  steadily decreased. In contrast, the number of subscribers grew,  albeit at a decreasing rate. In terms of Year-on-Year (YoY) changes in Divvy rides, the 2017 growth rate in subscribers was half that observed in 2016, and about a third of that in 2015.Slide27.gif

Between 2014 and 2017, usage of Divvy rides by both customers and subscribers was seasonal, typically increasing markedly in the warmer, summer months and steadily decreasing with the approaching winter. Nonetheless, the number of subscribers vastly outstripped that of customers, in any month of the year. Also, it is noticeable that the only the numbers of subscribers grew in the four-year period.


Weekday usage of Divvy rides by both customers and subscribers was somewhat reversed during the four years. That is, among customers weekday usage was highest during the weekend and dropped to its lowest by mid-week. The converse was true among subscribers.


The hour-of-the-day, Divvy ride usage profiles of customers and subscribers were very different during 2014-2017:

  • Customer usage distribution was uni-modal , peaking in the afternoon (around 14-15 Hours, or 2 to 3 PM) .
  • In contrast, subscriber usage distribution was a bi-modal, with two peaks during the morning rush hour (6H00 to 8H00, or 6AM and 8AM) and the evening rush hour (16H00 to 18H00, or 4 to 6 PM).  

Also noticeable in the hourly, Divvy ride usage profiles is:

  • The steady, upward growth in the subscriber numbers between 2014 and 2017; and
  • The customer usage jump between 10H00 and 17H00 (or 10 AM and 5PM) from 2014 to 2015. However, customer usage had dropped off from 2016, particularly after 14H00 (or 2 PM).




The median trip duration of customers was more than twice that of subscribers, during the study.


Generally, Divvy ride subscribers’ median ride duration increased during the warmer spring to summer months then fell-off sharply from autumn months in face of approaching winters. By contrast, customers’ median ride duration was not as sharply seasonal, particularly in 2017.


The day-of-the-week profiles of median ride duration in customers mirror those described previously for the number of rides by day of the week.  Of note,  their median ride duration tended to increase between 2014 and 2015, but not beyond. 

Median ride duration also increased significantly during weekends among subscribers.


In this study, the median trip duration was highest between 8H00 and 15H00, among the Divvy ride customers. This measure was highest during the morning rush-hour (from 7H00 to 9H00) and afternoon rush-hour (from 15H00 to 17H00) among subscribers.

Over the years, there was far less variability in median trip duration by daily hours among subscribers than among customers. In these, there was a  substantial yearly increase in the duration of Divvy rides taken before 8H00.  In this user type, the increase in median trip duration after 8H00 which occurred since 2014 had pitted-out by 2016.  


The five busiest dates in 2017 among Divvy ride customers coincided with the American public holidays, as shown above.Slide37

The five busiest day of the week of the year among Divvy ride customers were Mondays in 2017, as shown above. Slide38

The five busiest Divvy ride trip start times in 2017 among customers were in the afternoons around of the Independence Day Holiday, as shown above.


The five busiest morning rush hours among Divvy ride subscribers in 2017 were on the work dates shown above.Slide41

Tuesday was busiest day of the week in 2017 among Divvy ride subscribers, as shown above.


The five dates in 2017 with the busiest workday afternoons,  among Divvy ride subscribers are shown above.Slide44

The five busiest afternoon rush hours among Divvy ride subscribers in 2017 were on the work dates shown above.



This map shows that 592 Divvy ride stations in Chicago were active in mid-2018.Slide47

In 2017, most customers in Chicago took rides from and to the Divvy stations shown above.




In 2017, most subscribers took rides from and to the Chicago Divvy stations shown above during the morning rush hour.



In 2017, most subscribers took rides during the afternoon rush hour from and to the Chicago Divvy stations shown above.



This study used a number of high-end, state-of-the-art big data tools at various stages to undertake data extraction, preparation, loading, analysis, exploration, visualization and mapping.

Below are screen shots from these tools:



Slide Stata_final






Take home messages – a user-centric view

Divvy Rides rules, such as the requirement for regular bike check-ins depending on the purchased plan (e.g. annual membership, single ride, explorer pass …etc), shape trends observed the bike usage reflected in the Divvy Rides transactional data.


Divvy rides dataset:

  • Is a great source of information and insights:
  • There are two distinct user types, therefore two unique niches / market segments:
    • Customer: leisure / families / visitors
    • Subscriber: workers / business personnel
  • The two user types have distinct characteristics:
    • When they ride – temporal separation (different peak times and shapes)
    • How much they ride – rhythmic separation (number of rides and median ride duration)
    • From- and to- which Divvy stations – geo-spatial separation (recreational vs business)
  • Is an invaluable information source for an eco-friendly transportation in Chicago.


Take home messages – a data-centric view

Demerits – Big data problems:

  • Volume: lots of unit-level data
  • Velocity: rapid growth, particularly in the subscriber segment
  • Veracity:
    • Dump codes used in demographic characteristics (e.g. 1900 as year of birth), for user privacy
    • Some inconsistent data variable names and geo-coding between the years


Merits  – Big data attractions and opportunity for expansion:

  • Big data – large volumes of unit-level data; a rich data source for data analytics pedagogy
  • Variety – Good infrastructure to capture real-time transactional data with both geographic and temporal attributes
  • Quantitative insights from rides usage, by type…and to a limited extent user type demographics
  • Invaluable information source for planning – eco-friendly transportation