This document is a brief intro to basic text analysis in R, going from raw texts to the document term matrix
R has many built in sources that you can use to play with.
For examply, you can download a random book from the gutenberg library of public domain works:
library(gutenbergr)
books <- gutenberg_download(c(768, 1260), meta_fields = "title")
tail(books)
Many packages provide data sets. For example, corpus tools has the state of the union addresses of Bush and Obama:
data("sotu_texts", package="corpustools")
View(sotu_texts)
And quanteda has the inaugural addresses of all US presidents:
library(quanteda)
head(data_corpus_inaugural$documents)
Of course, R probably doesn’t have your data neatly packaged for you. If your data is a csv file with a text column and some metadata, you can directly use read.csv
.
If you have text files, the versatile readtext
package can read csv files but also (zipped) folders containing text, word, or PDF documents, from file or directly from the internet:
library(readtext)
d = readtext("/home/wva/research/texts")
url = "http://bit.ly/2uhqjJE?.csv"
d2 = readtext(url, text_field = "texts")
A term-document matrix is a matrix with documents in the rows, terms (words) in the columns, and the frequency of each term in each document in the cells.
We can use the quanteda
package to easily create a dfm from raw text. Note that quanteda
uses the term document-feature matrix since the columns can also be other features, such as word pairs, hence the function is called dfm
:
library(quanteda)
dfm(c("This is a text!", "this, is this more texts?"))
## Document-feature matrix of: 2 documents, 9 features (38.9% sparse).
## 2 x 9 sparse Matrix of class "dfm"
## features
## docs this is a text ! , more texts ?
## text1 1 1 1 1 1 0 0 0 0
## text2 2 1 0 0 0 1 1 1 1
As you can see, the word “this” occurs once in text1, and twice in text2.
Note: Because most words don’t occur in most documents, a dtm is often 99% zeroes (sparse). Internally, the dtm is stored in a sparse format, which means that the zeroes are not stored, so you can create a dtm of many documents and many words without running into memory problems.
We might wish to remove features that we don’t consider relevant, such as punctuation ,stopwords, and the difference between plural and singular (called stemming)
dfm(c("This is a text!", "this, is this more texts?"), stem=T, remove=stopwords("english"), remove_punct=T)
## Document-feature matrix of: 2 documents, 1 feature (0% sparse).
## 2 x 1 sparse Matrix of class "dfm"
## features
## docs text
## text1 1
## text2 1
As you can see, the only non-stopword left is text, present in both documents (the plural texts
is stemmed to text
)
The dfm function is like a swiss army knife with many options that does cleaning, preproccessing, and creates the dfm in a single call. It can be instructive to go through the steps one by one to have greater control over what happens:
library(corpustools)
tokens = tokens(sotu_texts$text, remove_punct = T)
tokens = tokens_tolower(tokens)
tokens = tokens_remove(tokens, c(stopwords("english")))
tokens = tokens_wordstem(tokens, "english")
dfm = dfm(tokens)
dfm = dfm_trim(dfm, min_docfreq = 5)
dfm
## Document-feature matrix of: 1,090 documents, 1,418 features (97.9% sparse).
You can also easily crete a word cloud: (who doesn’t like word clouds?)
textplot_wordcloud(dfm, max.words=100)
Another useful feature is corpus comparison. By comparing a subset of documents to the rest, we can see which words are overrepresented in that subset. Here, let’s compare Bush to Obama:
target = which(sotu_texts$party=="Republicans")
cmp = textstat_keyness(dfm, target)
head(cmp)
chi2 | p | n_target | n_reference | |
---|---|---|---|---|
must | 72.01472 | 0 | 188 | 54 |
terrorist | 69.19246 | 0 | 104 | 13 |
freedom | 63.33807 | 0 | 86 | 8 |
iraqi | 59.30715 | 0 | 69 | 3 |
iraq | 54.94805 | 0 | 93 | 15 |
terror | 54.42684 | 0 | 64 | 3 |
Quanteda also has a nice barplot to plot this ‘keyness’:
textplot_keyness(cmp)