Most frequent words in the e-book “The Murray River”

By Dr Gwinyai Nyakuengama

(1 October 2018)

TITLE: Most frequent words in the e-book “The Murray River – Being a Journal of the Voyage of the Lady Augusta Steamer from the Goolwa, in South Australia, to Gannewarra, above Swan Hill, Victoria. A distance from the sea mouth of 1400 miles“

E-book Author: Arthur Kinloch

Release Date: August 1, 2018 [EBook #57618]

Language: English

ACKNOWLEDGEMENTS: Text was gratefully sourced from the Gutenberg e-book Project

KEYWORDS

R; Word-cloud; The Murray River

INTRODUCTION

Welcome to our R blog!

In this short glob, we will use R to visualize the most frequent words in an e-book.

PROCEDURE

We executed the following R code to create the word-clouds, first with and then without information on the Gutenberg Project and copyright information:

########### start of program #####################
#turn warnings off
options(warn=-1)

#install required R packages

#install.packages(“tm”)#for text mining
#install.packages(“SnowballC”) # for text stemming
#install.packages(“wordcloud”) # word-cloud generator
#install.packages(“RColorBrewer”) # color palettes
#install.packages(“tidyverse”) # tidyverse

# load the required R packages
require(“tm”)
require(“SnowballC”)
require(“wordcloud”)
require(“RColorBrewer”)
require(tidyverse)

# load the text
filePath=”http://www.gutenberg.org/files/57618/57618-0.txt”
text <- readLines(filePath)

# load text as a corpus
docs <- Corpus(VectorSource(text))

#inspect(docs)

# transform text
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, ” “, x))
docs <- tm_map(docs, toSpace, “/”)
docs <- tm_map(docs, toSpace, “@”)
docs <- tm_map(docs, toSpace, “\\|”)
# clean up text
# convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# remove numbers
docs <- tm_map(docs, removeNumbers)
# remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords(“english”))
# remove own stop-word
#specify own stop-words as a character vector
docs <- tm_map(docs, removeWords, c(“project”,”gutenbergtm”,”gutenberg”, “one”))
# remove punctuations
docs <- tm_map(docs, removePunctuation)
# eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)

# create a term document matrix
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v), freq=v)
head(d, 10

# create word-cloud
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, “Dark2”))

########### end of program #####################

RESULTS

The most frequent words in The Murray River e-book , original text

The Murray River - Being a Journal of the Voyage of the Lady Augusta Steamer

The most frequent words in The Murray River e-book, minus the Gutenberg Project and copyright information

modified text

Interpretation

The most frequent words in this e-book are the Murray River, in South Australia. Words about steam navigation excellency, the waters, the considerable travel distance (miles) and time, the country-side, stations, towns and transported goods on the Lady Augusta steamer (e.g. sheep and wool) are also abundant in the e-book.

We could further remove more ‘filler- words’ (such as even, still, also, ever and within) from word-cloud.

Most frequent words in the e-book “The Murray River”

By Dr Gwinyai Nyakuengama

(1 October 2018)

Like this:

Published by predictivedatanalytics

Leave a ReplyCancel reply

By Dr Gwinyai Nyakuengama

(1 October 2018)

Share this:

Like this:

Published by predictivedatanalytics

Leave a ReplyCancel reply

Discover more from DatAnalytics