Fitting LDA models in R is technically quite simple: just call the LDA function from the topicmodels package. First, let’s create a document term matrix from the inaugural speeches in quanteda, at the paragraph level since we can expect these to be mostly about the same topic:

library(quanteda)
texts = corpus_reshape(data_corpus_inaugural, to = "paragraphs")
dfm = dfm(texts, remove_punct=T, remove=stopwords("english"))
dfm = dfm_trim(dfm, min_docfreq = 5)

To run LDA from a dfm, first convert to the topicmodels format, and then run LDA. Note the useof set.seed(.) to make sure that the analysis is reproducible.

library(topicmodels)
dtm = convert(dfm, to = "topicmodels") 
set.seed(1)
m = LDA(dtm, method = "Gibbs", k = 10,  control = list(alpha = 0.1))
m
## A LDA_Gibbs topic model with 10 topics.

Inspecting LDA results

We can use terms to look at the top terms per topic:

terms(m, 5)
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
great upon government us can nations government us world government
years shall people god every peace public let peace shall
states duties states day america war business can freedom congress
now country union president must foreign revenue new people may
upon people every new country united can must free law

The posterior function gives the posterior distribution of words and documents to topics, which can be used to plot a word cloud of terms proportional to their occurrence:

topic = 6
words = posterior(m)$terms[topic, ]
topwords = head(sort(words, decreasing = T), n=50)
head(topwords)
##    nations      peace        war    foreign     united     states 
## 0.02084942 0.01905361 0.01779653 0.01582114 0.01564156 0.01312741

Now we can plot these words:

library(wordcloud)
wordcloud(names(topwords), topwords)

We can also look at the topics per document, to find the top documents per topic:

topic.docs = posterior(m)$topics[, topic] 
topic.docs = sort(topic.docs, decreasing=T)
head(topic.docs)
##   1949-Truman.43 1805-Jefferson.3  1901-McKinley.2   1949-Truman.25 
##        0.9181818        0.9093023        0.9050000        0.8882353 
##   1813-Madison.7 1889-Harrison.22 
##        0.8714286        0.8627907

And we can find this document in the original texts by looking up the document id in the document variables docvars:

docs = docvars(dfm)
topdoc = names(topic.docs)[1]
docid = which(rownames(docs) == topdoc)
texts[docid]
##                                                                                                                                      1949-Truman.43 
## "In addition, we will provide military advice and equipment to free nations which will cooperate with us in the maintenance of peace and security."

Finally, we can see which president prefered which topics:

docs = docs[rownames(docs) %in% rownames(dtm), ]
tpp = aggregate(posterior(m)$topics, by=docs["President"], mean)
rownames(tpp) = tpp$President
heatmap(as.matrix(tpp[-1]))

As you can see, the topics form a sort of ‘block’ distribution, with more modern presidents and older presidents using quite different topics. So, either the role of presidents changed, or language use changed, or (probably) both.

To get a better fit of such temporal dynamics, see the session on structural topic models, which allow you to condition topic proportions and/or contents on metadata covariates such as source or date.

LDA from corpustools

Corpustools is a text analysis package that is based on a token list or tcorpus rather than a dtm, basically a list of words per document. This is less efficient than a dtm, but preserves word order so it allows you do some things that are impossible with a dtm, such as sliding window associationes (semantic networks), proximity conditions in dictionaries, and keyword-in-context lists.

It also allows you to reconstruct the original text from the tokens, which is useful for LDA since then we can view which words belonged to which topic in the actual documents.

First, we create a tcorpus, do some cleaning, and run lda_fit directly on the tcorpus. By running it on the tcorpus rather than on a dtm, the tcorpus internally adds the topic for each token:

library(corpustools)
tc = create_tcorpus(sotu_texts, doc_column='id')
tc$preprocess('token','feature', min_docfreq = 5, remove_stopwords = T)
set.seed(1)
m = tc$lda_fit('feature', create_feature = 'topic', K = 10, alpha = 0.01)

Now, we can create a LDA browser using the package tokenbrowser. You need to install this from github as it hasn’t been added to CRAN yet.

devtools::install_github("kasperwelbers/tokenbrowser")
library(tokenbrowser)
d = tc$get() # get the token information, including topic
url = categorical_reader(d, category = d$topic)
browseURL(url)

To find an interesting document to inspect, you can look at the top documents for a topic as above:

terms(m, 5)
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
energy security years us tax jobs schools will congress health
will will year people will new education iraq will care
years social next can year economy children america law insurance
new retirement last will now businesses school people must will
clean budget now one get now students terrorists ask coverage
tail(sort(posterior(m)$topics[,8]), n=10)
## 111552534 111552613 111552705 111552717 111552512 111552706 111542855 
## 0.9823529 0.9873239 0.9936170 0.9936170 0.9944099 0.9944099 0.9947368 
## 111552707 111542967 111542943 
## 0.9947368 0.9955224 0.9957346

So document 111542943 is the most strongly related document to topic 8 (iraq/terrorism).

You can also find documents which contains mixtures of topics. First, lets see how our topics are correlated:

c = cor(posterior(m)$topics)
diag(c) = 0
heatmap(c)

So, topics 6 (economy) and 7 (education) are quite similar in their document distribution, so they should co-occur relatively frequently.

Let’s find documents that contain these by calculating the mininum of their loadings and sorting on that:

p = as.data.frame(posterior(m)$topics)
p$min = apply(p[, c(6,7)], 1, min)
p = p[order(-p$min), ]
head(p)
1 2 3 4 5 6 7 8 9 10 min
111551947 0.0003215 0.0003215 0.0003215 0.0003215 0.0003215 0.4826367 0.5147910 0.0003215 0.0003215 0.0003215 0.4826367
111552161 0.1575916 0.0005236 0.0005236 0.0005236 0.0005236 0.4193717 0.4193717 0.0005236 0.0005236 0.0005236 0.4193717
111552255 0.0004975 0.0004975 0.0004975 0.0004975 0.0004975 0.5975124 0.3985075 0.0004975 0.0004975 0.0004975 0.3985075
111552549 0.2138790 0.0003559 0.0003559 0.0003559 0.0003559 0.3918149 0.3918149 0.0003559 0.0003559 0.0003559 0.3918149
111551870 0.0001783 0.0001783 0.0001783 0.0001783 0.0001783 0.3745098 0.6240642 0.0001783 0.0001783 0.0001783 0.3745098
111551939 0.0001783 0.0001783 0.0001783 0.0001783 0.0001783 0.4636364 0.3745098 0.0180036 0.1427807 0.0001783 0.3745098

So document 111551947 is 48% jobs and 51% education. You can look up this document in the topic browser opened above to see which words in that document belong to each topic.

You could also spot possibly ‘problematic’ documents by taking documents with very dispersed loadings:

p$max = apply(p[, 1:10], 1, max)
p = p[order(p$max), ]
head(p)
1 2 3 4 5 6 7 8 9 10 min max
111551914 0.0002494 0.0002494 0.1748130 0.2246883 0.1997506 0.1748130 0.1498753 0.0002494 0.0002494 0.0750623 0.1498753 0.2246883
111542063 0.0002320 0.0002320 0.2322506 0.0002320 0.1626450 0.1858469 0.2090487 0.0234339 0.0002320 0.1858469 0.1858469 0.2322506
111551859 0.2179673 0.1635209 0.2361162 0.0001815 0.0001815 0.1816697 0.0001815 0.0001815 0.1998185 0.0001815 0.0001815 0.2361162
111552103 0.0002770 0.2772853 0.2218837 0.2495845 0.2495845 0.0002770 0.0002770 0.0002770 0.0002770 0.0002770 0.0002770 0.2772853
111551832 0.1626450 0.0002320 0.2554524 0.2786543 0.0002320 0.2554524 0.0002320 0.0002320 0.0002320 0.0466357 0.0002320 0.2786543
111552658 0.0001751 0.0001751 0.1052539 0.1928196 0.0001751 0.2803853 0.2628722 0.0001751 0.1577933 0.0001751 0.2628722 0.2803853

Why is this document so ‘strange’? Is that a fault of LDA, or rather a strength?