Fitting LDA Models in R

Fitting LDA models in R is technically quite simple: just call the LDA function from the topicmodels package. First, let’s create a document term matrix from the inaugural speeches in quanteda, at the paragraph level since we can expect these to be mostly about the same topic:

library(quanteda)
texts = corpus_reshape(data_corpus_inaugural, to = "paragraphs")
dfm = dfm(texts, remove_punct=T, remove=stopwords("english"))
dfm = dfm_trim(dfm, min_docfreq = 5)

To run LDA from a dfm, first convert to the topicmodels format, and then run LDA. Note the useof set.seed(.) to make sure that the analysis is reproducible.

library(topicmodels)
dtm = convert(dfm, to = "topicmodels") 
set.seed(1)
m = LDA(dtm, method = "Gibbs", k = 10,  control = list(alpha = 0.1))
m

## A LDA_Gibbs topic model with 10 topics.

Inspecting LDA results

We can use terms to look at the top terms per topic:

terms(m, 5)

Topic 1	Topic 2	Topic 3	Topic 4	Topic 5	Topic 6	Topic 7	Topic 8	Topic 9	Topic 10
great	upon	government	us	can	nations	government	us	world	government
years	shall	people	god	every	peace	public	let	peace	shall
states	duties	states	day	america	war	business	can	freedom	congress
now	country	union	president	must	foreign	revenue	new	people	may
upon	people	every	new	country	united	can	must	free	law

The posterior function gives the posterior distribution of words and documents to topics, which can be used to plot a word cloud of terms proportional to their occurrence:

topic = 6
words = posterior(m)$terms[topic, ]
topwords = head(sort(words, decreasing = T), n=50)
head(topwords)

##    nations      peace        war    foreign     united     states 
## 0.02084942 0.01905361 0.01779653 0.01582114 0.01564156 0.01312741

Now we can plot these words:

library(wordcloud)
wordcloud(names(topwords), topwords)

We can also look at the topics per document, to find the top documents per topic:

topic.docs = posterior(m)$topics[, topic] 
topic.docs = sort(topic.docs, decreasing=T)
head(topic.docs)

##   1949-Truman.43 1805-Jefferson.3  1901-McKinley.2   1949-Truman.25 
##        0.9181818        0.9093023        0.9050000        0.8882353 
##   1813-Madison.7 1889-Harrison.22 
##        0.8714286        0.8627907

And we can find this document in the original texts by looking up the document id in the document variables docvars:

docs = docvars(dfm)
topdoc = names(topic.docs)[1]
docid = which(rownames(docs) == topdoc)
texts[docid]

##                                                                                                                                      1949-Truman.43 
## "In addition, we will provide military advice and equipment to free nations which will cooperate with us in the maintenance of peace and security."

Finally, we can see which president prefered which topics:

docs = docs[rownames(docs) %in% rownames(dtm), ]
tpp = aggregate(posterior(m)$topics, by=docs["President"], mean)
rownames(tpp) = tpp$President
heatmap(as.matrix(tpp[-1]))

As you can see, the topics form a sort of ‘block’ distribution, with more modern presidents and older presidents using quite different topics. So, either the role of presidents changed, or language use changed, or (probably) both.

To get a better fit of such temporal dynamics, see the session on structural topic models, which allow you to condition topic proportions and/or contents on metadata covariates such as source or date.

LDA from corpustools

Corpustools is a text analysis package that is based on a token list or tcorpus rather than a dtm, basically a list of words per document. This is less efficient than a dtm, but preserves word order so it allows you do some things that are impossible with a dtm, such as sliding window associationes (semantic networks), proximity conditions in dictionaries, and keyword-in-context lists.

It also allows you to reconstruct the original text from the tokens, which is useful for LDA since then we can view which words belonged to which topic in the actual documents.

First, we create a tcorpus, do some cleaning, and run lda_fit directly on the tcorpus. By running it on the tcorpus rather than on a dtm, the tcorpus internally adds the topic for each token:

library(corpustools)
tc = create_tcorpus(sotu_texts, doc_column='id')
tc$preprocess('token','feature', min_docfreq = 5, remove_stopwords = T)
set.seed(1)
m = tc$lda_fit('feature', create_feature = 'topic', K = 10, alpha = 0.01)

Now, we can create a LDA browser using the package tokenbrowser. You need to install this from github as it hasn’t been added to CRAN yet.

devtools::install_github("kasperwelbers/tokenbrowser")
library(tokenbrowser)
d = tc$get() # get the token information, including topic
url = categorical_reader(d, category = d$topic)
browseURL(url)

To find an interesting document to inspect, you can look at the top documents for a topic as above:

terms(m, 5)

Topic 1	Topic 2	Topic 3	Topic 4	Topic 5	Topic 6	Topic 7	Topic 8	Topic 9	Topic 10
energy	security	years	us	tax	jobs	schools	will	congress	health
will	will	year	people	will	new	education	iraq	will	care
years	social	next	can	year	economy	children	america	law	insurance
new	retirement	last	will	now	businesses	school	people	must	will
clean	budget	now	one	get	now	students	terrorists	ask	coverage

tail(sort(posterior(m)$topics[,8]), n=10)

## 111552534 111552613 111552705 111552717 111552512 111552706 111542855 
## 0.9823529 0.9873239 0.9936170 0.9936170 0.9944099 0.9944099 0.9947368 
## 111552707 111542967 111542943 
## 0.9947368 0.9955224 0.9957346

So document 111542943 is the most strongly related document to topic 8 (iraq/terrorism).

You can also find documents which contains mixtures of topics. First, lets see how our topics are correlated:

c = cor(posterior(m)$topics)
diag(c) = 0
heatmap(c)

So, topics 6 (economy) and 7 (education) are quite similar in their document distribution, so they should co-occur relatively frequently.

Let’s find documents that contain these by calculating the mininum of their loadings and sorting on that:

p = as.data.frame(posterior(m)$topics)
p$min = apply(p[, c(6,7)], 1, min)
p = p[order(-p$min), ]
head(p)

	1	2	3	4	5	6	7	8	9	10	min
111551947	0.0003215	0.0003215	0.0003215	0.0003215	0.0003215	0.4826367	0.5147910	0.0003215	0.0003215	0.0003215	0.4826367
111552161	0.1575916	0.0005236	0.0005236	0.0005236	0.0005236	0.4193717	0.4193717	0.0005236	0.0005236	0.0005236	0.4193717
111552255	0.0004975	0.0004975	0.0004975	0.0004975	0.0004975	0.5975124	0.3985075	0.0004975	0.0004975	0.0004975	0.3985075
111552549	0.2138790	0.0003559	0.0003559	0.0003559	0.0003559	0.3918149	0.3918149	0.0003559	0.0003559	0.0003559	0.3918149
111551870	0.0001783	0.0001783	0.0001783	0.0001783	0.0001783	0.3745098	0.6240642	0.0001783	0.0001783	0.0001783	0.3745098
111551939	0.0001783	0.0001783	0.0001783	0.0001783	0.0001783	0.4636364	0.3745098	0.0180036	0.1427807	0.0001783	0.3745098

So document 111551947 is 48% jobs and 51% education. You can look up this document in the topic browser opened above to see which words in that document belong to each topic.

You could also spot possibly ‘problematic’ documents by taking documents with very dispersed loadings:

p$max = apply(p[, 1:10], 1, max)
p = p[order(p$max), ]
head(p)

	1	2	3	4	5	6	7	8	9	10	min	max
111551914	0.0002494	0.0002494	0.1748130	0.2246883	0.1997506	0.1748130	0.1498753	0.0002494	0.0002494	0.0750623	0.1498753	0.2246883
111542063	0.0002320	0.0002320	0.2322506	0.0002320	0.1626450	0.1858469	0.2090487	0.0234339	0.0002320	0.1858469	0.1858469	0.2322506
111551859	0.2179673	0.1635209	0.2361162	0.0001815	0.0001815	0.1816697	0.0001815	0.0001815	0.1998185	0.0001815	0.0001815	0.2361162
111552103	0.0002770	0.2772853	0.2218837	0.2495845	0.2495845	0.0002770	0.0002770	0.0002770	0.0002770	0.0002770	0.0002770	0.2772853
111551832	0.1626450	0.0002320	0.2554524	0.2786543	0.0002320	0.2554524	0.0002320	0.0002320	0.0002320	0.0466357	0.0002320	0.2786543
111552658	0.0001751	0.0001751	0.1052539	0.1928196	0.0001751	0.2803853	0.2628722	0.0001751	0.1577933	0.0001751	0.2628722	0.2803853

Why is this document so ‘strange’? Is that a fault of LDA, or rather a strength?

Fitting LDA Models in R

Wouter van Atteveldt

February 13, 2018

Inspecting LDA results

LDA from corpustools