Fitting LDA models in R is technically quite simple: just call the LDA
function from the topicmodels
package. First, let’s create a document term matrix from the inaugural speeches in quanteda, at the paragraph level since we can expect these to be mostly about the same topic:
library(quanteda)
texts = corpus_reshape(data_corpus_inaugural, to = "paragraphs")
dfm = dfm(texts, remove_punct=T, remove=stopwords("english"))
dfm = dfm_trim(dfm, min_docfreq = 5)
To run LDA from a dfm, first convert to the topicmodels format, and then run LDA. Note the useof set.seed(.)
to make sure that the analysis is reproducible.
library(topicmodels)
dtm = convert(dfm, to = "topicmodels")
set.seed(1)
m = LDA(dtm, method = "Gibbs", k = 10, control = list(alpha = 0.1))
m
## A LDA_Gibbs topic model with 10 topics.
We can use terms
to look at the top terms per topic:
terms(m, 5)
Topic 1 | Topic 2 | Topic 3 | Topic 4 | Topic 5 | Topic 6 | Topic 7 | Topic 8 | Topic 9 | Topic 10 |
---|---|---|---|---|---|---|---|---|---|
great | upon | government | us | can | nations | government | us | world | government |
years | shall | people | god | every | peace | public | let | peace | shall |
states | duties | states | day | america | war | business | can | freedom | congress |
now | country | union | president | must | foreign | revenue | new | people | may |
upon | people | every | new | country | united | can | must | free | law |
The posterior
function gives the posterior distribution of words and documents to topics, which can be used to plot a word cloud of terms proportional to their occurrence:
topic = 6
words = posterior(m)$terms[topic, ]
topwords = head(sort(words, decreasing = T), n=50)
head(topwords)
## nations peace war foreign united states
## 0.02084942 0.01905361 0.01779653 0.01582114 0.01564156 0.01312741
Now we can plot these words:
library(wordcloud)
wordcloud(names(topwords), topwords)
We can also look at the topics per document, to find the top documents per topic:
topic.docs = posterior(m)$topics[, topic]
topic.docs = sort(topic.docs, decreasing=T)
head(topic.docs)
## 1949-Truman.43 1805-Jefferson.3 1901-McKinley.2 1949-Truman.25
## 0.9181818 0.9093023 0.9050000 0.8882353
## 1813-Madison.7 1889-Harrison.22
## 0.8714286 0.8627907
And we can find this document in the original texts by looking up the document id in the document variables docvars
:
docs = docvars(dfm)
topdoc = names(topic.docs)[1]
docid = which(rownames(docs) == topdoc)
texts[docid]
## 1949-Truman.43
## "In addition, we will provide military advice and equipment to free nations which will cooperate with us in the maintenance of peace and security."
Finally, we can see which president prefered which topics:
docs = docs[rownames(docs) %in% rownames(dtm), ]
tpp = aggregate(posterior(m)$topics, by=docs["President"], mean)
rownames(tpp) = tpp$President
heatmap(as.matrix(tpp[-1]))
As you can see, the topics form a sort of ‘block’ distribution, with more modern presidents and older presidents using quite different topics. So, either the role of presidents changed, or language use changed, or (probably) both.
To get a better fit of such temporal dynamics, see the session on structural topic models, which allow you to condition topic proportions and/or contents on metadata covariates such as source or date.
Corpustools is a text analysis package that is based on a token list or tcorpus
rather than a dtm, basically a list of words per document. This is less efficient than a dtm, but preserves word order so it allows you do some things that are impossible with a dtm, such as sliding window associationes (semantic networks), proximity conditions in dictionaries, and keyword-in-context lists.
It also allows you to reconstruct the original text from the tokens, which is useful for LDA since then we can view which words belonged to which topic in the actual documents.
First, we create a tcorpus, do some cleaning, and run lda_fit directly on the tcorpus. By running it on the tcorpus rather than on a dtm, the tcorpus internally adds the topic for each token:
library(corpustools)
tc = create_tcorpus(sotu_texts, doc_column='id')
tc$preprocess('token','feature', min_docfreq = 5, remove_stopwords = T)
set.seed(1)
m = tc$lda_fit('feature', create_feature = 'topic', K = 10, alpha = 0.01)
Now, we can create a LDA browser using the package tokenbrowser
. You need to install this from github as it hasn’t been added to CRAN yet.
devtools::install_github("kasperwelbers/tokenbrowser")
library(tokenbrowser)
d = tc$get() # get the token information, including topic
url = categorical_reader(d, category = d$topic)
browseURL(url)
To find an interesting document to inspect, you can look at the top documents for a topic as above:
terms(m, 5)
Topic 1 | Topic 2 | Topic 3 | Topic 4 | Topic 5 | Topic 6 | Topic 7 | Topic 8 | Topic 9 | Topic 10 |
---|---|---|---|---|---|---|---|---|---|
energy | security | years | us | tax | jobs | schools | will | congress | health |
will | will | year | people | will | new | education | iraq | will | care |
years | social | next | can | year | economy | children | america | law | insurance |
new | retirement | last | will | now | businesses | school | people | must | will |
clean | budget | now | one | get | now | students | terrorists | ask | coverage |
tail(sort(posterior(m)$topics[,8]), n=10)
## 111552534 111552613 111552705 111552717 111552512 111552706 111542855
## 0.9823529 0.9873239 0.9936170 0.9936170 0.9944099 0.9944099 0.9947368
## 111552707 111542967 111542943
## 0.9947368 0.9955224 0.9957346
So document 111542943 is the most strongly related document to topic 8 (iraq/terrorism).
You can also find documents which contains mixtures of topics. First, lets see how our topics are correlated:
c = cor(posterior(m)$topics)
diag(c) = 0
heatmap(c)
So, topics 6 (economy) and 7 (education) are quite similar in their document distribution, so they should co-occur relatively frequently.
Let’s find documents that contain these by calculating the mininum of their loadings and sorting on that:
p = as.data.frame(posterior(m)$topics)
p$min = apply(p[, c(6,7)], 1, min)
p = p[order(-p$min), ]
head(p)
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | min | |
---|---|---|---|---|---|---|---|---|---|---|---|
111551947 | 0.0003215 | 0.0003215 | 0.0003215 | 0.0003215 | 0.0003215 | 0.4826367 | 0.5147910 | 0.0003215 | 0.0003215 | 0.0003215 | 0.4826367 |
111552161 | 0.1575916 | 0.0005236 | 0.0005236 | 0.0005236 | 0.0005236 | 0.4193717 | 0.4193717 | 0.0005236 | 0.0005236 | 0.0005236 | 0.4193717 |
111552255 | 0.0004975 | 0.0004975 | 0.0004975 | 0.0004975 | 0.0004975 | 0.5975124 | 0.3985075 | 0.0004975 | 0.0004975 | 0.0004975 | 0.3985075 |
111552549 | 0.2138790 | 0.0003559 | 0.0003559 | 0.0003559 | 0.0003559 | 0.3918149 | 0.3918149 | 0.0003559 | 0.0003559 | 0.0003559 | 0.3918149 |
111551870 | 0.0001783 | 0.0001783 | 0.0001783 | 0.0001783 | 0.0001783 | 0.3745098 | 0.6240642 | 0.0001783 | 0.0001783 | 0.0001783 | 0.3745098 |
111551939 | 0.0001783 | 0.0001783 | 0.0001783 | 0.0001783 | 0.0001783 | 0.4636364 | 0.3745098 | 0.0180036 | 0.1427807 | 0.0001783 | 0.3745098 |
So document 111551947 is 48% jobs and 51% education. You can look up this document in the topic browser opened above to see which words in that document belong to each topic.
You could also spot possibly ‘problematic’ documents by taking documents with very dispersed loadings:
p$max = apply(p[, 1:10], 1, max)
p = p[order(p$max), ]
head(p)
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | min | max | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
111551914 | 0.0002494 | 0.0002494 | 0.1748130 | 0.2246883 | 0.1997506 | 0.1748130 | 0.1498753 | 0.0002494 | 0.0002494 | 0.0750623 | 0.1498753 | 0.2246883 |
111542063 | 0.0002320 | 0.0002320 | 0.2322506 | 0.0002320 | 0.1626450 | 0.1858469 | 0.2090487 | 0.0234339 | 0.0002320 | 0.1858469 | 0.1858469 | 0.2322506 |
111551859 | 0.2179673 | 0.1635209 | 0.2361162 | 0.0001815 | 0.0001815 | 0.1816697 | 0.0001815 | 0.0001815 | 0.1998185 | 0.0001815 | 0.0001815 | 0.2361162 |
111552103 | 0.0002770 | 0.2772853 | 0.2218837 | 0.2495845 | 0.2495845 | 0.0002770 | 0.0002770 | 0.0002770 | 0.0002770 | 0.0002770 | 0.0002770 | 0.2772853 |
111551832 | 0.1626450 | 0.0002320 | 0.2554524 | 0.2786543 | 0.0002320 | 0.2554524 | 0.0002320 | 0.0002320 | 0.0002320 | 0.0466357 | 0.0002320 | 0.2786543 |
111552658 | 0.0001751 | 0.0001751 | 0.1052539 | 0.1928196 | 0.0001751 | 0.2803853 | 0.2628722 | 0.0001751 | 0.1577933 | 0.0001751 | 0.2628722 | 0.2803853 |
Why is this document so ‘strange’? Is that a fault of LDA, or rather a strength?