My first work with R

This was the very first word I did with the help of R on 16th June 2017!

Word cloud is the Visual Representation of the text data. The most used word stand out. They are easy to understand and are simple.

The packages to be used for generating word cloud are

tm - for text mining

SnowballC - for text stemming

wordcloud - word cloud generator

RColorBrewer - for color palettes

Step 1 - Install and Load the required packages

The packages can be installed using the code install.packages("Package Name")

For example, install.packages("tm")

To load the package in our program, the code is library("Package Name")

For example, library("SnowballC")

Step 2 - Text Mining

Set the file path to fetch the text document

filepath<-"~/leader.txt"

Read the text document using the function readLines()

text <- readLines(filePath)

Load the data as Corpus

docs <- Corpus(VectorSource(text))

Inspect the content

inspect(docs)

Step 3 - Text Transformation

content_transformer() modifies the content of the text

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))

Step 4 - Text Cleaning

tm_map() is used to remove the unnecessary white space using stripWhitespace, to convert to lower case using tolower, to remove stopwords like the, we etc. using removeWords, stopwords(language), to remove symbols like /, @ using toSpace and the numbers using removeNumbers and also the other punctuations using removePunctuation. Text stemming using stemDocument reduces the word to the root form i.e., removes the prefixes and suffixes and make it simple. For example, arranging, arranged, arrangement, arranges can be stemmed to arrange. We can also remove our own stopword using removeWords attribute followed by the vector of words as tm_map(docs, removeWords, c("Meena", "Data", "Science"))

docs <- tm_map(docs, toSpace, "/")

docs <- tm_map(docs, toSpace, "@")

docs <- tm_map(docs, toSpace, "\\|")

docs <- tm_map(docs, content_transformer(tolower))

docs <- tm_map(docs, removeNumbers)

docs <- tm_map(docs, removeWords, stopwords("english"))

docs <- tm_map(docs, removePunctuation)

docs <- tm_map(docs, stripWhitespace)

docs <- tm_map(docs, stemDocument)

Step 5 - Building Term Document Matrix

Document Matrix is a table with the frequency of words. The function TermDocumentMatrix() from tm package can be used.

dtm <- TermDocumentMatrix(docs)

Converting the result into matrix format using the function as.matrix()

m <- as.matrix(dtm)

The names are sorted in the descending order of the frequency.

v <- sort(rowSums(m),decreasing=TRUE)

It is made as a dataframe and the first 10 names are considered for plotting.

d <- data.frame(word = names(v),freq=v)

head(d, 10)

Step 6 - Word Cloud Generation

The function wordcloud() is used, with the attributes words, freq, min.freq, max.words, random.order, rot.per and colors.

To know more about wordcloud funcction, have a look at,

https://www.rdocumentation.org/packages/wordcloud/versions/2.5/topics/wordcloud

set.seed(1234)

wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))

Step 7 - Explore frequent terms

To find the word which appears atleast four times, the function findFreqTerms() is used.

findFreqTerms(dtm, lowfreq = 4)

To find the association between each terms, findAssocs() function is used.

findAssocs(dtm, terms = "freedom", corlimit = 0.3)

To plot the frequency, barplot() is used. Have a look at,

https://www.rdocumentation.org/packages/graphics/versions/3.5.0/topics/barplot

barplot(d[1:10,]$freq, las = 2, names.arg = d[1:10,]$word, col ="lightblue", main ="Most frequent words", ylab = "Word frequencies")

This is the output - Word Cloud for the given Word Document with the most important words as leader, team, dhoni, leadership.

This is the barplot with words having more frequency.

This is the text document on the topic "Leadership Lessons from MSD" and for which the word cloud has been generated.

Search This Blog

Data Science

My first work with R

Comments

Post a Comment