Using R and RapidMiner Auto Model to rapidly and reliably choose a great red from 40,000 Kaggle wine review texts.

 

FreeGreatPicture.com-52861-red-wine

By Dr Gwinyai Nyakuengama

(2 January 2019)

 

KEY WORDS

Kaggle Amazon wine reviews; R; word2vec, h2o routine; RapidMiner Auto Model; Automatic Feature Engineering; Supervised Machine Learning Models; Naive Bayes; Generalized Linear Model; Logistic Regression; Deep Learning; Random Forest; Gradient Boosted Trees; Support Vector Machine; Model performance; Receiver Operator Curve;  Confusion Matrix

Using R and RapidMiner to rapidly and reliably choose a good red from 40,000 Kaggle wine review texts -20180102

Stylometry – Authorship Attribution (Early British Fictionists)

Slide1.gif

by Dr John Gwinyai Nyakuengama

(2 December 2018)

 

KEY WORDS

Early British Fictionists; Jane Eyre; Stylometry; Unsupervised Machine Learning;

QDA Miner and WordStat; Python; k-Nearest Neighbour learning method with leave-one-out

CONTEXT

Stylometry is the application of the study of linguistic style. It is often used to define the author’s writeprint (Rygl, 2016). Steps in Machine Learning stylometry comprise data acquisition, feature extraction, machine learning through training and testing classifiers and interpretation of results.

In this experiment, we anonymised the book, Jane Eyre by Charlotte Brontë.

We used Python and two Provalis packages, QDA Miner and WordStat, to undertake ML stylometry, that is to identify the correct author of this “mystery/disputed” book, based on corpora (books) written by early British Fictionists, including Charlotte Brontë.

Charlotte Brontë lived from 1816 to 1855. Jane Eyre appeared in 1847 and was followed by Shirley (1848) and Vilette (1853). The Professor was published posthumously in 1857.

METHODS

Authorship Attribution using Python

We followed the method of Dr François Dominic Laramée (2018):

  • Created individual files of 23 British Fictionists / authors.
  • Adapted his Python codebook and attributed authorship using three methods:
  • Mendenhall’s Characteristic Curves of Composition
  • John Burrows’ Delta Method
  • Kilgariff’s Chi-Squared Method

 

Authorship Attribution using QDA Miner/WordStat

We followed the method of Dr Normad Peladeau and the QDA Miner and WordStat Users Guides:

  • Created a QDA project with the individual files of the Early British Fictionists.
  • Created and saved a WordStat classification model (*.wclas) for the potential authors.
  • Created a QDA project for the mystery / disputed work.
  • Classified the mystery /disputed corpora using the WordStat model.

 

RESULTS

Stylometry – Authorship Attribution using Python

Slide7.gif

These graphs show word counts vs samples for the British fictionists. Clearly, the profile of the “disputed” / mystery author  resembled that of CBronte (Charlotte Bronte) the most.

Slide8.gif

These results show that the Delta Score of the “disputed” / mystery book were most similar to CBronte.

 

Slide9

The Chi-squared statistics for CBronte was much smaller than that of CDickens, a possible writer of the “disputed” / mystery corpora.

 

 

Stylometry – Authorship Attribution using QDA Miner/WordStat

Slide11.gif

This image shows the QDA Miner project of three British fictionists, CharlesDickens, CharlotteBronte and JaneAusten, that were subsequently used to create a Machine Learning classification model in WordStat.

Slide13-3885037938-1543727747827.png

This image shows the creation of a Machine Learning classification model in WordStat. The k-Nearest Neighbour learning method had been used, with a leave-one-out validation.

Results also show good model performance in terms of precision, accuracy and recall. Definitions of these Machine Learning model performance terms were given previously, (see Nyakuengama 2018: https://dat-analytics.net/2018/07/28/use-of-rapidminer-auto-model-to-predict-customer-churn/).

  

Slide14

This shows a step in QDA Miner before applying the Machine Learning classification model on the “disputed” / mystery corpora.

Slide15

This image is after the WordStat Machine Learning classification model had been applied in QDA Miner. Most importantly, it shows that the model had correctly picked Charlotte Bronte as the mystery author of Jane Eyre.

Slide16.gif

This WordStat Group Dendrogram also suggested that the author of the mystery book (anonymised_test_case) was most likely CharlotteBronte.

Slide17.gif

This WordStat Correspondence Analysis chart also suggested that the author of the mystery case (anonymised_test_case) was most likely CharlotteBronte.

 

CONCLUSIONS

This short blog, show-cased two Machine Learning stylometric methods implemented using Python and the Provalis packages, QDA Miner and WordStat.

  • Both methods correctly identified Charlotte Brontë as the author of Jane Eyre. Each method yielded useful and complementary information.
  • We note that:
    • Python can handle large corpora and is programmatically more challenging than the Provalis packages.
    • The Provalis packages can just as easily handle several million words and billion tokens. The trick is to have fast computer processors when using them to process large documents / corpora.

 

In our future blog:

  • Python and the Provalis packages, QDA Miner and WordStat, will be used to undertake more complex Machine Learning of unstructured texts.
  • We may also show-case the mapping and data visualization packages (namely, Tibco Spotfire, Tableau and Power Bi) which we currently use synergistically with other advanced data analytics tools (such as Stata and RapidMiner).

 

 

ACKNOWLEDGEMENTS

We thank owner of the Early British Fictionists GitHub resource: A_Small_Collection_of_British_Fiction .

We thank Dr Normand Peladeau for QDA Miner and WordStat and his associated webinars:

  • Supervised and Unsupervised Machine Learning Features.
  • Webinar on the New Features of WordStat 8 – Content.

 

We thank Dr Francois Dominic Laramée for sharing his Python, Jupyter Notebook used in his blog: François Dominic Laramée, “Introduction to stylometry with Python,” The Programming Historian 7 (2018),

https://programminghistorian.org/en/lessons/introduction-to-stylometry-with-python.

 

We thank  Survey Design and Analysis Services:  https://surveydesign.com.au/ , vendor of QDA Miner, WordStat and Stata.

We thank Anaconda, distributors of Python and Jupyter Notebook.

 

 

BIBLIOGRAPHY

Dr Jan RyglPA153: Stylometric analysis of texts using machine learning techniques” NLP Centre, Fac. Informatics, Masaryk University.

Dr Normand Peladeau’s webinars on QDA Miner and WordStat:

  • Supervised and Unsupervised Machine Learning Features
  • Webinar on the New Features of WordStat 8 – Content

Dr François Dominic Laramée, “Introduction to stylometry with Python,” The Programming Historian 7 (2018), https://programminghistorian.org/en/lessons/introduction-to-stylometry-with-python

WordStat / QDA Miner Users Manualshttps://provalisresearch.com/

 

 

 

 

 

BIG DATA ANALYTICS AND VISUALIZATION OF CHICAGO DIVVY RIDES (From 2014 to 2017)

By Dr John Gwinyai Nyakuengama

(20 October 2018)

Slide1c

KEY WORDS

Chicago City, Divvy Bicycles (Divvy Rides) big dataset from 2014 -2017; Big data analytics, visualization and mapping;  Stata; R; RapidMiner Turbo Prep; Tibco Spotfire; Power Bi; Google Maps

 

ABSTRACT

This study analysed the Chicago Divvy rides user transactional, big dataset collected between 2014 to 2017.

It found that over 13.5 Million trips were taken during that period. In 2017 alone, over 590 Divvy ride stations operated over 6,240 individual bikes.

The Chicago Divvy rides users (customers and subscribers) showed two different usage patterns in terms of the:

  • Number of Divvy rides and median trip duration, as well as year-on-year growth patterns;
  • Time of access to Divvy rides (by day of week and by time of day); and
  • Divvy stations from which users had travelled from and to.

The current study identified some big data merits and challenges in the Chicago Divvy rides dataset and show-cased a number of big data analysis, visualization and mapping tools.

 

Slide2

Slide3c.gif

Slide4b.gif

 

Slide5

 

 

Slide6

Slide7a

Slide8

Slide9

Slide10

 

Slide12

Slide13

Slide14a

Slide15

 

 

Slide16c

 

Slide17b

 

 

 

Slide18Slide19Slide20Slide21Slide22

 

 

 

Slide23Slide24Slide25

Between 2014 and 2017, about 3.5 Million customers and 9.7 Million subscribers accessed Divvy rides. This means that customers and subscribers comprised about a quarter and three quarters of all Divvy ride users, respectively.

Slide26.gif

From 2014 to 2017, the number of Divvy ride customers  steadily decreased. In contrast, the number of subscribers grew,  albeit at a decreasing rate. In terms of Year-on-Year (YoY) changes in Divvy rides, the 2017 growth rate in subscribers was half that observed in 2016, and about a third of that in 2015.Slide27.gif

Between 2014 and 2017, usage of Divvy rides by both customers and subscribers was seasonal, typically increasing markedly in the warmer, summer months and steadily decreasing with the approaching winter. Nonetheless, the number of subscribers vastly outstripped that of customers, in any month of the year. Also, it is noticeable that the only the numbers of subscribers grew in the four-year period.

Slide28.gif

Weekday usage of Divvy rides by both customers and subscribers was somewhat reversed during the four years. That is, among customers weekday usage was highest during the weekend and dropped to its lowest by mid-week. The converse was true among subscribers.

Slide29.gif

The hour-of-the-day, Divvy ride usage profiles of customers and subscribers were very different during 2014-2017:

  • Customer usage distribution was uni-modal , peaking in the afternoon (around 14-15 Hours, or 2 to 3 PM) .
  • In contrast, subscriber usage distribution was a bi-modal, with two peaks during the morning rush hour (6H00 to 8H00, or 6AM and 8AM) and the evening rush hour (16H00 to 18H00, or 4 to 6 PM).  

Also noticeable in the hourly, Divvy ride usage profiles is:

  • The steady, upward growth in the subscriber numbers between 2014 and 2017; and
  • The customer usage jump between 10H00 and 17H00 (or 10 AM and 5PM) from 2014 to 2015. However, customer usage had dropped off from 2016, particularly after 14H00 (or 2 PM).

 

Slide30

Slide31.gif

The median trip duration of customers was more than twice that of subscribers, during the study.

Slide32.gif

Generally, Divvy ride subscribers’ median ride duration increased during the warmer spring to summer months then fell-off sharply from autumn months in face of approaching winters. By contrast, customers’ median ride duration was not as sharply seasonal, particularly in 2017.

Slide33.gif

The day-of-the-week profiles of median ride duration in customers mirror those described previously for the number of rides by day of the week.  Of note,  their median ride duration tended to increase between 2014 and 2015, but not beyond. 

Median ride duration also increased significantly during weekends among subscribers.

Slide34.gif

In this study, the median trip duration was highest between 8H00 and 15H00, among the Divvy ride customers. This measure was highest during the morning rush-hour (from 7H00 to 9H00) and afternoon rush-hour (from 15H00 to 17H00) among subscribers.

Over the years, there was far less variability in median trip duration by daily hours among subscribers than among customers. In these, there was a  substantial yearly increase in the duration of Divvy rides taken before 8H00.  In this user type, the increase in median trip duration after 8H00 which occurred since 2014 had pitted-out by 2016.  

Slide35Slide36

The five busiest dates in 2017 among Divvy ride customers coincided with the American public holidays, as shown above.Slide37

The five busiest day of the week of the year among Divvy ride customers were Mondays in 2017, as shown above. Slide38

The five busiest Divvy ride trip start times in 2017 among customers were in the afternoons around of the Independence Day Holiday, as shown above.

Slide39Slide40

The five busiest morning rush hours among Divvy ride subscribers in 2017 were on the work dates shown above.Slide41

Tuesday was busiest day of the week in 2017 among Divvy ride subscribers, as shown above.

Slide42Slide43

The five dates in 2017 with the busiest workday afternoons,  among Divvy ride subscribers are shown above.Slide44

The five busiest afternoon rush hours among Divvy ride subscribers in 2017 were on the work dates shown above.

Slide45

Slide46

This map shows that 592 Divvy ride stations in Chicago were active in mid-2018.Slide47

In 2017, most customers in Chicago took rides from and to the Divvy stations shown above.

Slide48

 

Slide49

In 2017, most subscribers took rides from and to the Chicago Divvy stations shown above during the morning rush hour.

Slide50

Slide51

In 2017, most subscribers took rides during the afternoon rush hour from and to the Chicago Divvy stations shown above.

 

Slide52

This study used a number of high-end, state-of-the-art big data tools at various stages to undertake data extraction, preparation, loading, analysis, exploration, visualization and mapping.

Below are screen shots from these tools:

Slide53

 

Slide Stata_final

 

Slide55

Slide56

Slide57

Slide58

Take home messages – a user-centric view

Divvy Rides rules, such as the requirement for regular bike check-ins depending on the purchased plan (e.g. annual membership, single ride, explorer pass …etc), shape trends observed the bike usage reflected in the Divvy Rides transactional data.

 

Divvy rides dataset:

  • Is a great source of information and insights:
  • There are two distinct user types, therefore two unique niches / market segments:
    • Customer: leisure / families / visitors
    • Subscriber: workers / business personnel
  • The two user types have distinct characteristics:
    • When they ride – temporal separation (different peak times and shapes)
    • How much they ride – rhythmic separation (number of rides and median ride duration)
    • From- and to- which Divvy stations – geo-spatial separation (recreational vs business)
  • Is an invaluable information source for an eco-friendly transportation in Chicago.

 

Take home messages – a data-centric view

Demerits – Big data problems:

  • Volume: lots of unit-level data
  • Velocity: rapid growth, particularly in the subscriber segment
  • Veracity:
    • Dump codes used in demographic characteristics (e.g. 1900 as year of birth), for user privacy
    • Some inconsistent data variable names and geo-coding between the years

 

Merits  – Big data attractions and opportunity for expansion:

  • Big data – large volumes of unit-level data; a rich data source for data analytics pedagogy
  • Variety – Good infrastructure to capture real-time transactional data with both geographic and temporal attributes
  • Quantitative insights from rides usage, by type…and to a limited extent user type demographics
  • Invaluable information source for planning – eco-friendly transportation

 

 

Slide61

Slide62b4

Slide63

 

Slide64

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Time Series Prediction of Daily Total Female Births in California – January, 1960

By Dr Gwinyai Nyakuengama

(3 October 2018)

 

KEY WORDS

ARFIMA; Time Series; Daily female births in California; Stata; R package – Prophet

 

ACKNOWLEDGEMENT

We gratefully acknowledge:

  • StataCorp, Survey Design and Analysis Services:  https://surveydesign.com.au/, Doctor Becketti (2013), authors of the R package – Prophet for their core macros in Stata and R language;
  • datamarket.com for their data; and
  • some anonymous colleagues.

Collectively, these parties not only inspired but underpinned this blog.

 

OBJECTIVE

To predict the 30-day, daily total female births in California, for January 1960.

 

METHOD

In this study:

  • Daily total female births (female for California reported in 1959 were accessed from datamarket.com .
  • Stata was used to test for stationarity in this time series data.
  • Stata was used to fit an Auto-Regressive Fractionally Integrated Moving Average
    (ARFIMA) model and to predict the daily female births for the month of January, 1960.
  • Also, the R package – Prophet, was used to fit a time-series model with additive seasonalities, meaning the effect of the seasonality is added to the trend to get the forecast.

 

RESULTS

Cali_daily_birth_Slide3

This Stata plot of the daily female births in California for 1959 showed that the data has very high volatility.

This was suggestive of:

  • a non-stationary time series, and most importantly, the existence of a long-memory volatility in the series; and
  • an ARFIMA modelling solution to predict the daily female births for the month of January, 1960.

 

Cali_daily_birth_Slide4

These Stata auto-correlation and partial auto-correlation plots also suggested the presence of serial correlation in the female daily birth time series.

 

Cali_daily_birth_Slide5

Based on these Stata Dickey-Fuller test results, we failed to reject the null hypothesis of a random walk with a possible drift in the female daily births.

 

Cali_daily_birth_Slide6

In Stata, the commonly used criteria for choosing appropriate time series lags are Schwarz’s Bayesian information criterion (SBIC), the Akaike’s information criterion (AIC), Final Prediction Error (FPE) and the Hannan and Quinn information criterion (HQIC). It turns out that AIC works well on monthly data.

The above results from Stata’s vector auto-regressive selection order (vascor) macro indicate that the second lag (ar2) was picked by most decision criteria (i.e. FPE , AIC and HQIC). However, a lagged 1 period (ar1) was selected using the SBIC criterion.

 

Cali_daily_birth_Slide7

The DFGLS: Stata module to compute Dickey-Fuller/GLS unit root test command:

  • calculated the optimal lag length using a sequential t-test (Ng and Perron, 1995), Schwert criterion (SC) and the “modified AIC” (MAIC) statistical criteria as 6, 1 and 7, respectively; and
  • controls for a linear time trend by default unlike the Stata dfuller or pperron commands.

Based on these results:

  • we failed to reject the null hypothesis of a random walk with drift in the daily girl birth series; and
  • the daily female births were accurately estimated, judging by the relatively low root-mean-square error (rmse) of around seven daily girl births, considering the high volatility of this time series.

 

Cali_daily_birth_Slide8

The above Stata ARFIMA regression results suggested:

  • a significant model fit and, more importantly;
  • d, the fractionally differenced component of the predicted series, reflected a significant a fractionally integrated process with 0 < d < ½ ; and
  • the L1.ar and L1.ma were both significant.

 

Cali_daily_birth_Slide9

This Stata plot shows:

  • that the dynamic forecasting (xb prediction) obtained using the ARFIMA model faithfully tracked the observed daily female births through out 1959; and
  • the 30-day, daily female birth prediction for January 1960, with the 90 per cent prediction intervals around the mean.

Cali_daily_birth_Slide10

Just focusing on the 30-day prediction from the Stata ARFIMA model:

  • the daily female births in January 1960, was around 43 births, contained within the 90% CI bands (see next figure); and
  • on average, this prediction had a root-mean-square error (rmse) of 7 daily female births (see next figure).

 

Cali_daily_birth_Slide11

The Stata ARFIMA model’s 30-day predictions in January 1960 show;

  • around 43 daily female births, with a;
  • root-mean-square error (rmse) of around 7 daily female births.
  • note that this figure agrees with the estimate shown earlier that was obtained using the Stata command, df-gls.

 

We also predicted the births using the R package – Prophet, tuned the predictions to 90% CI , same as in Stata.

Cali_daily_birth_Slide14

This plot from the R package – Prophet shows:

  • periodicity in the daily female births; and
  • presence of outliers – dots outside the shaded blue area (90% CI).

The average root-mean-square error (rmse) from R was also around seven daily female births (or 7.2 exactly).

 

Cali_daily_birth_Slide15

Just focusing on the 30-day prediction in January 1960, these two plots from the R package – Prophet show:

  • strong weekly periodicity in the daily female births; with
  • peaks every Tuesday and Wednesday and troughs every Sunday.

 

CONCLUSION

  • The Stata ARFIMA model was an excellent fit of the highly volatile, daily female births in California for 1959.
  • On average, the Stata ARFIMA model predicted 43 daily female births (+/- seven births) for the month of January, 1959.
  • Pleasingly, both Stata and the R package – Prophet gave consistent and complementary results.
  • Additionally, the R package – Prophet picked up some strong weekly periodicity in the data – with most births occurring on Tuesdays and Wednesdays and the least births occurring on Sundays.

 

BIBLIOGRAPHY

Becketti S. (2013): Introduction to Time Series Using Stata 1st Edition, Stata Press https://www.amazon.com/Introduction-Time-Using-Stata-Becketti/dp/1597181323

Ivanov, V. and Kilian, L. 2001. ‘A Practitioner’s Guide to Lag-Order Selection for Vector Autoregressions’. CEPR Discussion Paper no. 2685. London, Centre for Economic Policy Research. http://www.cepr.org/pubs/dps/DP2685.asp

Prophet: https://facebook.github.io/prophet/docs/quick_start.html

Prophet R package: June 15, 2018 https://cran.r-project.org/web/packages/prophet/prophet.pdf

StataCorp 2013: Stata Time-Series Reference Manual https://www.stata.com/manuals/ts.pdf.