Basic text analysis in R

This document is a brief intro to basic text analysis in R, going from raw texts to the document term matrix

Getting data

Built-in data sources

R has many built in sources that you can use to play with.

For examply, you can download a random book from the gutenberg library of public domain works:

books <- gutenberg_download(c(768, 1260), meta_fields = "title")

Many packages provide data sets. For example, corpus tools has the state of the union addresses of Bush and Obama:

data("sotu_texts", package="corpustools")

And quanteda has the inaugural addresses of all US presidents:


Reading your own data

Of course, R probably doesn’t have your data neatly packaged for you. If your data is a csv file with a text column and some metadata, you can directly use read.csv.

If you have text files, the versatile readtext package can read csv files but also (zipped) folders containing text, word, or PDF documents, from file or directly from the internet:

d = readtext("/home/wva/research/texts")
url = ""
d2 = readtext(url, text_field = "texts")

Creating the Term-Document matrix

A term-document matrix is a matrix with documents in the rows, terms (words) in the columns, and the frequency of each term in each document in the cells.

We can use the quanteda package to easily create a dfm from raw text. Note that quanteda uses the term document-feature matrix since the columns can also be other features, such as word pairs, hence the function is called dfm:

dfm(c("This is a text!", "this, is this more texts?"))
## Document-feature matrix of: 2 documents, 9 features (38.9% sparse).
## 2 x 9 sparse Matrix of class "dfm"
##        features
## docs    this is a text ! , more texts ?
##   text1    1  1 1    1 1 0    0     0 0
##   text2    2  1 0    0 0 1    1     1 1

As you can see, the word “this” occurs once in text1, and twice in text2.

Note: Because most words don’t occur in most documents, a dtm is often 99% zeroes (sparse). Internally, the dtm is stored in a sparse format, which means that the zeroes are not stored, so you can create a dtm of many documents and many words without running into memory problems.

We might wish to remove features that we don’t consider relevant, such as punctuation ,stopwords, and the difference between plural and singular (called stemming)

dfm(c("This is a text!", "this, is this more texts?"), stem=T, remove=stopwords("english"), remove_punct=T)
## Document-feature matrix of: 2 documents, 1 feature (0% sparse).
## 2 x 1 sparse Matrix of class "dfm"
##        features
## docs    text
##   text1    1
##   text2    1

As you can see, the only non-stopword left is text, present in both documents (the plural texts is stemmed to text)

Quanteda step-by-step

The dfm function is like a swiss army knife with many options that does cleaning, preproccessing, and creates the dfm in a single call. It can be instructive to go through the steps one by one to have greater control over what happens:

tokens = tokens(sotu_texts$text, remove_punct = T)
tokens = tokens_tolower(tokens)
tokens = tokens_remove(tokens, c(stopwords("english")))
tokens = tokens_wordstem(tokens, "english")
dfm = dfm(tokens)
dfm = dfm_trim(dfm, min_docfreq = 5)
## Document-feature matrix of: 1,090 documents, 1,418 features (97.9% sparse).

You can also easily crete a word cloud: (who doesn’t like word clouds?)

textplot_wordcloud(dfm, max.words=100)

Corpus comparison

Another useful feature is corpus comparison. By comparing a subset of documents to the rest, we can see which words are overrepresented in that subset. Here, let’s compare Bush to Obama:

target = which(sotu_texts$party=="Republicans")
cmp = textstat_keyness(dfm, target)
chi2 p n_target n_reference
must 72.01472 0 188 54
terrorist 69.19246 0 104 13
freedom 63.33807 0 86 8
iraqi 59.30715 0 69 3
iraq 54.94805 0 93 15
terror 54.42684 0 64 3

Quanteda also has a nice barplot to plot this ‘keyness’: