By Dr Gwinyai Nyakuengama
(1 October 2018)
TITLE: Most frequent words in the e-book “The Murray River – Being a Journal of the Voyage of the Lady Augusta Steamer from the Goolwa, in South Australia, to Gannewarra, above Swan Hill, Victoria. A distance from the sea mouth of 1400 miles“
E-book Author: Arthur Kinloch
Release Date: August 1, 2018 [EBook #57618]
Language: English
ACKNOWLEDGEMENTS: Text was gratefully sourced from the Gutenberg e-book Project
KEYWORDS
R; Word-cloud; The Murray River
INTRODUCTION
Welcome to our R blog!
In this short glob, we will use R to visualize the most frequent words in an e-book.
PROCEDURE
We executed the following R code to create the word-clouds, first with and then without information on the Gutenberg Project and copyright information:
########### start of program #####################
#turn warnings off
options(warn=-1)
#install required R packages
#install.packages(“tm”)#for text mining
#install.packages(“SnowballC”) # for text stemming
#install.packages(“wordcloud”) # word-cloud generator
#install.packages(“RColorBrewer”) # color palettes
#install.packages(“tidyverse”) # tidyverse
# load the required R packages
require(“tm”)
require(“SnowballC”)
require(“wordcloud”)
require(“RColorBrewer”)
require(tidyverse)
# load the text
filePath=”http://www.gutenberg.org/files/57618/57618-0.txt”
text <- readLines(filePath)
# load text as a corpus
docs <- Corpus(VectorSource(text))
#inspect(docs)
# transform text
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, ” “, x))
docs <- tm_map(docs, toSpace, “/”)
docs <- tm_map(docs, toSpace, “@”)
docs <- tm_map(docs, toSpace, “\\|”)
# clean up text
# convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# remove numbers
docs <- tm_map(docs, removeNumbers)
# remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords(“english”))
# remove own stop-word
#specify own stop-words as a character vector
docs <- tm_map(docs, removeWords, c(“project”,”gutenbergtm”,”gutenberg”, “one”))
# remove punctuations
docs <- tm_map(docs, removePunctuation)
# eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
# create a term document matrix
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v), freq=v)
head(d, 10
# create word-cloud
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, “Dark2”))
########### end of program #####################
RESULTS
The most frequent words in The Murray River e-book , original text
The most frequent words in The Murray River e-book, minus the Gutenberg Project and copyright information
Interpretation
The most frequent words in this e-book are the Murray River, in South Australia. Words about steam navigation excellency, the waters, the considerable travel distance (miles) and time, the country-side, stations, towns and transported goods on the Lady Augusta steamer (e.g. sheep and wool) are also abundant in the e-book.
We could further remove more ‘filler- words’ (such as even, still, also, ever and within) from word-cloud.