Predicting flight classes using unsupervised Machine Learning in Stata: Principal Component Analysis and Discriminant Analysis

 

TitlePic3

By Dr Gwinyai Nyakuengama

(10 January 2019)

 

KEY WORDS

Stata commands: pca, discrim logistic, discrim lda, discrim knn and candisc; Unsupervised Machine Learning; Accuracy and Error Rates; Principal Component Analysis; Linear Discriminant Analysis; Dimension Reduction; Eigenvalues; Customer Satisfaction Survey Results.

 

Predicting flight classes using unsupervised Machine Learning in Stata – Principal Component Analysis and Discriminant Analysis

Using R and RapidMiner Auto Model to rapidly and reliably choose a great red from 40,000 Kaggle wine review texts.

 

FreeGreatPicture.com-52861-red-wine

By Dr Gwinyai Nyakuengama

(2 January 2019)

 

KEY WORDS

Kaggle Amazon wine reviews; R; word2vec, h2o routine; RapidMiner Auto Model; Automatic Feature Engineering; Supervised Machine Learning Models; Naive Bayes; Generalized Linear Model; Logistic Regression; Deep Learning; Random Forest; Gradient Boosted Trees; Support Vector Machine; Model performance; Receiver Operator Curve;  Confusion Matrix

Using R and RapidMiner to rapidly and reliably choose a good red from 40,000 Kaggle wine review texts -20180102

Stylometry – Authorship Attribution (Early British Fictionists)

Slide1.gif

by Dr John Gwinyai Nyakuengama

(2 December 2018)

 

KEY WORDS

Early British Fictionists; Jane Eyre; Stylometry; Unsupervised Machine Learning;

QDA Miner and WordStat; Python; k-Nearest Neighbour learning method with leave-one-out

CONTEXT

Stylometry is the application of the study of linguistic style. It is often used to define the author’s writeprint (Rygl, 2016). Steps in Machine Learning stylometry comprise data acquisition, feature extraction, machine learning through training and testing classifiers and interpretation of results.

In this experiment, we anonymised the book, Jane Eyre by Charlotte Brontë.

We used Python and two Provalis packages, QDA Miner and WordStat, to undertake ML stylometry, that is to identify the correct author of this “mystery/disputed” book, based on corpora (books) written by early British Fictionists, including Charlotte Brontë.

Charlotte Brontë lived from 1816 to 1855. Jane Eyre appeared in 1847 and was followed by Shirley (1848) and Vilette (1853). The Professor was published posthumously in 1857.

METHODS

Authorship Attribution using Python

We followed the method of Dr François Dominic Laramée (2018):

  • Created individual files of 23 British Fictionists / authors.
  • Adapted his Python codebook and attributed authorship using three methods:
  • Mendenhall’s Characteristic Curves of Composition
  • John Burrows’ Delta Method
  • Kilgariff’s Chi-Squared Method

 

Authorship Attribution using QDA Miner/WordStat

We followed the method of Dr Normad Peladeau and the QDA Miner and WordStat Users Guides:

  • Created a QDA project with the individual files of the Early British Fictionists.
  • Created and saved a WordStat classification model (*.wclas) for the potential authors.
  • Created a QDA project for the mystery / disputed work.
  • Classified the mystery /disputed corpora using the WordStat model.

 

RESULTS

Stylometry – Authorship Attribution using Python

Slide7.gif

These graphs show word counts vs samples for the British fictionists. Clearly, the profile of the “disputed” / mystery author  resembled that of CBronte (Charlotte Bronte) the most.

Slide8.gif

These results show that the Delta Score of the “disputed” / mystery book were most similar to CBronte.

 

Slide9

The Chi-squared statistics for CBronte was much smaller than that of CDickens, a possible writer of the “disputed” / mystery corpora.

 

 

Stylometry – Authorship Attribution using QDA Miner/WordStat

Slide11.gif

This image shows the QDA Miner project of three British fictionists, CharlesDickens, CharlotteBronte and JaneAusten, that were subsequently used to create a Machine Learning classification model in WordStat.

Slide13-3885037938-1543727747827.png

This image shows the creation of a Machine Learning classification model in WordStat. The k-Nearest Neighbour learning method had been used, with a leave-one-out validation.

Results also show good model performance in terms of precision, accuracy and recall. Definitions of these Machine Learning model performance terms were given previously, (see Nyakuengama 2018: https://dat-analytics.net/2018/07/28/use-of-rapidminer-auto-model-to-predict-customer-churn/).

  

Slide14

This shows a step in QDA Miner before applying the Machine Learning classification model on the “disputed” / mystery corpora.

Slide15

This image is after the WordStat Machine Learning classification model had been applied in QDA Miner. Most importantly, it shows that the model had correctly picked Charlotte Bronte as the mystery author of Jane Eyre.

Slide16.gif

This WordStat Group Dendrogram also suggested that the author of the mystery book (anonymised_test_case) was most likely CharlotteBronte.

Slide17.gif

This WordStat Correspondence Analysis chart also suggested that the author of the mystery case (anonymised_test_case) was most likely CharlotteBronte.

 

CONCLUSIONS

This short blog, show-cased two Machine Learning stylometric methods implemented using Python and the Provalis packages, QDA Miner and WordStat.

  • Both methods correctly identified Charlotte Brontë as the author of Jane Eyre. Each method yielded useful and complementary information.
  • We note that:
    • Python can handle large corpora and is programmatically more challenging than the Provalis packages.
    • The Provalis packages can just as easily handle several million words and billion tokens. The trick is to have fast computer processors when using them to process large documents / corpora.

 

In our future blog:

  • Python and the Provalis packages, QDA Miner and WordStat, will be used to undertake more complex Machine Learning of unstructured texts.
  • We may also show-case the mapping and data visualization packages (namely, Tibco Spotfire, Tableau and Power Bi) which we currently use synergistically with other advanced data analytics tools (such as Stata and RapidMiner).

 

 

ACKNOWLEDGEMENTS

We thank owner of the Early British Fictionists GitHub resource: A_Small_Collection_of_British_Fiction .

We thank Dr Normand Peladeau for QDA Miner and WordStat and his associated webinars:

  • Supervised and Unsupervised Machine Learning Features.
  • Webinar on the New Features of WordStat 8 – Content.

 

We thank Dr Francois Dominic Laramée for sharing his Python, Jupyter Notebook used in his blog: François Dominic Laramée, “Introduction to stylometry with Python,” The Programming Historian 7 (2018),

https://programminghistorian.org/en/lessons/introduction-to-stylometry-with-python.

 

We thank  Survey Design and Analysis Services:  https://surveydesign.com.au/ , vendor of QDA Miner, WordStat and Stata.

We thank Anaconda, distributors of Python and Jupyter Notebook.

 

 

BIBLIOGRAPHY

Dr Jan RyglPA153: Stylometric analysis of texts using machine learning techniques” NLP Centre, Fac. Informatics, Masaryk University.

Dr Normand Peladeau’s webinars on QDA Miner and WordStat:

  • Supervised and Unsupervised Machine Learning Features
  • Webinar on the New Features of WordStat 8 – Content

Dr François Dominic Laramée, “Introduction to stylometry with Python,” The Programming Historian 7 (2018), https://programminghistorian.org/en/lessons/introduction-to-stylometry-with-python

WordStat / QDA Miner Users Manualshttps://provalisresearch.com/