By Dr Gwinyai Nyakuengama
(3 October 2018)
ARFIMA; Time Series; Daily female births in California; Stata; R package – Prophet
We gratefully acknowledge:
- StataCorp, Survey Design and Analysis Services: https://surveydesign.com.au/, Doctor Becketti (2013), authors of the R package – Prophet for their core macros in Stata and R language;
- datamarket.com for their data; and
- some anonymous colleagues.
Collectively, these parties not only inspired but underpinned this blog.
To predict the 30-day, daily total female births in California, for January 1960.
In this study:
- Daily total female births (female for California reported in 1959 were accessed from datamarket.com .
- Stata was used to test for stationarity in this time series data.
- Stata was used to fit an Auto-Regressive Fractionally Integrated Moving Average
(ARFIMA) model and to predict the daily female births for the month of January, 1960.
- Also, the R package – Prophet, was used to fit a time-series model with additive seasonalities, meaning the effect of the seasonality is added to the trend to get the forecast.
This Stata plot of the daily female births in California for 1959 showed that the data has very high volatility.
This was suggestive of:
- a non-stationary time series, and most importantly, the existence of a long-memory volatility in the series; and
- an ARFIMA modelling solution to predict the daily female births for the month of January, 1960.
These Stata auto-correlation and partial auto-correlation plots also suggested the presence of serial correlation in the female daily birth time series.
Based on these Stata Dickey-Fuller test results, we failed to reject the null hypothesis of a random walk with a possible drift in the female daily births.
In Stata, the commonly used criteria for choosing appropriate time series lags are Schwarz’s Bayesian information criterion (SBIC), the Akaike’s information criterion (AIC), Final Prediction Error (FPE) and the Hannan and Quinn information criterion (HQIC). It turns out that AIC works well on monthly data.
The above results from Stata’s vector auto-regressive selection order (vascor) macro indicate that the second lag (ar2) was picked by most decision criteria (i.e. FPE , AIC and HQIC). However, a lagged 1 period (ar1) was selected using the SBIC criterion.
The DFGLS: Stata module to compute Dickey-Fuller/GLS unit root test command:
- calculated the optimal lag length using a sequential t-test (Ng and Perron, 1995), Schwert criterion (SC) and the “modified AIC” (MAIC) statistical criteria as 6, 1 and 7, respectively; and
- controls for a linear time trend by default unlike the Stata dfuller or pperron commands.
Based on these results:
- we failed to reject the null hypothesis of a random walk with drift in the daily girl birth series; and
- the daily female births were accurately estimated, judging by the relatively low root-mean-square error (rmse) of around seven daily girl births, considering the high volatility of this time series.
The above Stata ARFIMA regression results suggested:
- a significant model fit and, more importantly;
- d, the fractionally differenced component of the predicted series, reflected a significant a fractionally integrated process with 0 < d < ½ ; and
- the L1.ar and L1.ma were both significant.
This Stata plot shows:
- that the dynamic forecasting (xb prediction) obtained using the ARFIMA model faithfully tracked the observed daily female births through out 1959; and
- the 30-day, daily female birth prediction for January 1960, with the 90 per cent prediction intervals around the mean.
Just focusing on the 30-day prediction from the Stata ARFIMA model:
- the daily female births in January 1960, was around 43 births, contained within the 90% CI bands (see next figure); and
- on average, this prediction had a root-mean-square error (rmse) of 7 daily female births (see next figure).
The Stata ARFIMA model’s 30-day predictions in January 1960 show;
- around 43 daily female births, with a;
- root-mean-square error (rmse) of around 7 daily female births.
- note that this figure agrees with the estimate shown earlier that was obtained using the Stata command, df-gls.
We also predicted the births using the R package – Prophet, tuned the predictions to 90% CI , same as in Stata.
This plot from the R package – Prophet shows:
- periodicity in the daily female births; and
- presence of outliers – dots outside the shaded blue area (90% CI).
The average root-mean-square error (rmse) from R was also around seven daily female births (or 7.2 exactly).
Just focusing on the 30-day prediction in January 1960, these two plots from the R package – Prophet show:
- strong weekly periodicity in the daily female births; with
- peaks every Tuesday and Wednesday and troughs every Sunday.
- The Stata ARFIMA model was an excellent fit of the highly volatile, daily female births in California for 1959.
- On average, the Stata ARFIMA model predicted 43 daily female births (+/- seven births) for the month of January, 1959.
- Pleasingly, both Stata and the R package – Prophet gave consistent and complementary results.
- Additionally, the R package – Prophet picked up some strong weekly periodicity in the data – with most births occurring on Tuesdays and Wednesdays and the least births occurring on Sundays.
Becketti S. (2013): Introduction to Time Series Using Stata 1st Edition, Stata Press https://www.amazon.com/Introduction-Time-Using-Stata-Becketti/dp/1597181323
Ivanov, V. and Kilian, L. 2001. ‘A Practitioner’s Guide to Lag-Order Selection for Vector Autoregressions’. CEPR Discussion Paper no. 2685. London, Centre for Economic Policy Research. http://www.cepr.org/pubs/dps/DP2685.asp
Prophet R package: June 15, 2018 https://cran.r-project.org/web/packages/prophet/prophet.pdf
StataCorp 2013: Stata Time-Series Reference Manual https://www.stata.com/manuals/ts.pdf.