by Dr John Gwinyai Nyakuengama
(2 December 2018)
KEY WORDS
Early British Fictionists; Jane Eyre; Stylometry; Unsupervised Machine Learning;
QDA Miner and WordStat; Python; k-Nearest Neighbour learning method with leave-one-out
CONTEXT
Stylometry is the application of the study of linguistic style. It is often used to define the author’s writeprint (Rygl, 2016). Steps in Machine Learning stylometry comprise data acquisition, feature extraction, machine learning through training and testing classifiers and interpretation of results.
In this experiment, we anonymised the book, Jane Eyre by Charlotte Brontë.
We used Python and two Provalis packages, QDA Miner and WordStat, to undertake ML stylometry, that is to identify the correct author of this “mystery/disputed” book, based on corpora (books) written by early British Fictionists, including Charlotte Brontë.
Charlotte Brontë lived from 1816 to 1855. Jane Eyre appeared in 1847 and was followed by Shirley (1848) and Vilette (1853). The Professor was published posthumously in 1857.
METHODS
Authorship Attribution using Python
We followed the method of Dr François Dominic Laramée (2018):
- Created individual files of 23 British Fictionists / authors.
- Adapted his Python codebook and attributed authorship using three methods:
- Mendenhall’s Characteristic Curves of Composition
- John Burrows’ Delta Method
- Kilgariff’s Chi-Squared Method
Authorship Attribution using QDA Miner/WordStat
We followed the method of Dr Normad Peladeau and the QDA Miner and WordStat Users Guides:
- Created a QDA project with the individual files of the Early British Fictionists.
- Created and saved a WordStat classification model (*.wclas) for the potential authors.
- Created a QDA project for the mystery / disputed work.
- Classified the mystery /disputed corpora using the WordStat model.
RESULTS
Stylometry – Authorship Attribution using Python
These graphs show word counts vs samples for the British fictionists. Clearly, the profile of the “disputed” / mystery author resembled that of CBronte (Charlotte Bronte) the most.
These results show that the Delta Score of the “disputed” / mystery book were most similar to CBronte.
The Chi-squared statistics for CBronte was much smaller than that of CDickens, a possible writer of the “disputed” / mystery corpora.
Stylometry – Authorship Attribution using QDA Miner/WordStat
This image shows the QDA Miner project of three British fictionists, CharlesDickens, CharlotteBronte and JaneAusten, that were subsequently used to create a Machine Learning classification model in WordStat.
This image shows the creation of a Machine Learning classification model in WordStat. The k-Nearest Neighbour learning method had been used, with a leave-one-out validation.
Results also show good model performance in terms of precision, accuracy and recall. Definitions of these Machine Learning model performance terms were given previously, (see Nyakuengama 2018: https://dat-analytics.net/2018/07/28/use-of-rapidminer-auto-model-to-predict-customer-churn/).
This shows a step in QDA Miner before applying the Machine Learning classification model on the “disputed” / mystery corpora.
This image is after the WordStat Machine Learning classification model had been applied in QDA Miner. Most importantly, it shows that the model had correctly picked Charlotte Bronte as the mystery author of Jane Eyre.
This WordStat Group Dendrogram also suggested that the author of the mystery book (anonymised_test_case) was most likely CharlotteBronte.
This WordStat Correspondence Analysis chart also suggested that the author of the mystery case (anonymised_test_case) was most likely CharlotteBronte.
CONCLUSIONS
This short blog, show-cased two Machine Learning stylometric methods implemented using Python and the Provalis packages, QDA Miner and WordStat.
- Both methods correctly identified Charlotte Brontë as the author of Jane Eyre. Each method yielded useful and complementary information.
- We note that:
- Python can handle large corpora and is programmatically more challenging than the Provalis packages.
- The Provalis packages can just as easily handle several million words and billion tokens. The trick is to have fast computer processors when using them to process large documents / corpora.
In our future blog:
- Python and the Provalis packages, QDA Miner and WordStat, will be used to undertake more complex Machine Learning of unstructured texts.
- We may also show-case the mapping and data visualization packages (namely, Tibco Spotfire, Tableau and Power Bi) which we currently use synergistically with other advanced data analytics tools (such as Stata and RapidMiner).
ACKNOWLEDGEMENTS
We thank owner of the Early British Fictionists GitHub resource: A_Small_Collection_of_British_Fiction .
We thank Dr Normand Peladeau for QDA Miner and WordStat and his associated webinars:
- Supervised and Unsupervised Machine Learning Features.
- Webinar on the New Features of WordStat 8 – Content.
We thank Dr Francois Dominic Laramée for sharing his Python, Jupyter Notebook used in his blog: François Dominic Laramée, “Introduction to stylometry with Python,” The Programming Historian 7 (2018),
https://programminghistorian.org/en/lessons/introduction-to-stylometry-with-python.
We thank Survey Design and Analysis Services: https://surveydesign.com.au/ , vendor of QDA Miner, WordStat and Stata.
We thank Anaconda, distributors of Python and Jupyter Notebook.
BIBLIOGRAPHY
Dr Jan Rygl “PA153: Stylometric analysis of texts using machine learning techniques” NLP Centre, Fac. Informatics, Masaryk University.
Dr Normand Peladeau’s webinars on QDA Miner and WordStat:
- Supervised and Unsupervised Machine Learning Features
- Webinar on the New Features of WordStat 8 – Content
Dr François Dominic Laramée, “Introduction to stylometry with Python,” The Programming Historian 7 (2018), https://programminghistorian.org/en/lessons/introduction-to-stylometry-with-python
WordStat / QDA Miner Users Manuals: https://provalisresearch.com/