My first work with R

This was the very first word I did with the help of R on 16th June 2017!

Word cloud is the Visual Representation of the text data. The most used word stand out. They are easy to understand and are simple. 

The packages to be used for generating word cloud are
tm - for text mining
SnowballC - for text stemming
wordcloud - word cloud generator
RColorBrewer - for color palettes

Step 1 - Install and Load the required packages
The packages can be installed using the code install.packages("Package Name")
For example, install.packages("tm")

To load the package in our program, the code is library("Package Name")
For example, library("SnowballC")

Step 2 - Text Mining
Set the file path to fetch the text document
filepath<-"~/leader.txt" 

Read the text document using the function readLines()
text <- readLines(filePath)

Load the data as Corpus
docs <- Corpus(VectorSource(text))

Inspect the content
inspect(docs)

Step 3 - Text Transformation
content_transformer() modifies the content of the text
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))

Step 4 - Text Cleaning
tm_map() is used to remove the unnecessary white space using stripWhitespace, to convert to lower case using tolower, to remove stopwords like the, we etc. using removeWords, stopwords(language), to remove symbols like /, @ using toSpace and the numbers using removeNumbers and also the other punctuations using removePunctuation. Text stemming using stemDocument reduces the word to the root form i.e., removes the prefixes and suffixes and make it simple. For example, arranging, arranged, arrangement, arranges can be stemmed to arrange. We can also remove our own stopword using removeWords attribute followed by the vector of words as tm_map(docs, removeWords, c("Meena", "Data", "Science"))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, stemDocument)

Step 5 - Building Term Document Matrix
Document Matrix is a table with the frequency of words. The function TermDocumentMatrix() from tm package can be used.
dtm <- TermDocumentMatrix(docs)

Converting the result into matrix format using the function as.matrix()
m <- as.matrix(dtm)

The names are sorted in the descending order of the frequency.
v <- sort(rowSums(m),decreasing=TRUE)

It is made as a dataframe and the first 10 names are considered for plotting.
d <- data.frame(word = names(v),freq=v)
head(d, 10)

Step 6 - Word Cloud Generation
The function wordcloud() is used, with the attributes words, freq, min.freq, max.words, random.order, rot.per and colors.
To know more about wordcloud funcction, have a look at, 

set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))

Step 7 - Explore frequent terms
To find the word which appears atleast four times, the function findFreqTerms() is used.
findFreqTerms(dtm, lowfreq = 4)

To find the association between each terms, findAssocs() function is used.
findAssocs(dtm, terms = "freedom", corlimit = 0.3)

To plot the frequency, barplot() is used. Have a look at,
barplot(d[1:10,]$freq, las = 2, names.arg = d[1:10,]$word, col ="lightblue", main ="Most frequent words", ylab = "Word frequencies")


This is the output - Word Cloud for the given Word Document with the most important words as leader, team, dhoni, leadership.
This is the barplot with words having more frequency.



This is the text document on the topic "Leadership Lessons from MSD" and for which the word cloud has been generated.

Comments