Structural Topic Models

Structural topic models are an extension of LDA that allow us to explicitly model text metadata such as date or author as covariates of the topic prevalence and/or topic words distributions. In the stm package it has excellent support for R (see http://structuraltopicmodel.com) (although fitting the models can be a bit slower than fitting regular LDA models)

In this document we will model the State of the Union speeches of Bush and Obama, using the year and President as covariates. First, we read the data into a (quanteda) corpus to make sure the metadata is kept with the text, and then we create a dfm:

library(quanteda)
data("sotu_texts", package="corpustools")
sotu_texts$id = as.character(sotu_texts$id)
sotu_texts$year = as.numeric(format(sotu_texts$date, "%Y"))
sotu = corpus(sotu_texts, docid_field="id", text_field="text")
sotu_dfm = dfm(sotu, remove_punct=T, remove=stopwords("english"))
sotu_dfm = dfm_trim(sotu_dfm, min_count = 2)
head(docvars(sotu_dfm))
date party president year
111552549 2001-02-27 Republicans George W. Bush 2001
111552556 2001-02-27 Republicans George W. Bush 2001
111552570 2001-02-27 Republicans George W. Bush 2001
111552599 2001-02-27 Republicans George W. Bush 2001
111542780 2001-02-27 Republicans George W. Bush 2001
111542800 2001-02-27 Republicans George W. Bush 2001
table(docvars(sotu_dfm)$president)
Barack Obama George W. Bush
554 536

Now, we can use the stm function to fit a topic model. First, let’s fit one without any covariates:

library(stm)
m = stm(sotu_dfm, K = 10, max.em.its = 100, control=list(alpha=1))

This words similarly to a regular topic model, but it also models correlations between topics. To inspect topic model results, we can use functions from the stm package:

plot(m, type="summary")

labelTopics(m, topic=9)
## Topic 9 Top Words:
##       Highest Prob: energy, new, jobs, years, america, clean, economy 
##       FREX: house, solar, environment, energy, cleaner, clean, oil 
##       Lift: 2017, 75, 80, acres, all-of-the-above, answers, automobiles 
##       Score: house, energy, clean, oil, jobs, solar, renewable

By default, stm gives multiple ways of inspecting top words: simple highest probabibility (the first row) and three ways of finding the most ‘typical’ words by focusing on words that are less common in other topics.

We can also plot the words per topic and the words ‘between’ two topics:

cloud(m, topic=9)

plot(m, type="perspectives", topics=c(4,5))

Prevalence covariate

Now, let’s model year as a covariance of the prevalence:

m2 = stm(sotu_dfm, K = 10, prevalence =~ year, max.em.its = 100)

Besides the functions above, we can now also model the effect of year using the estimateEffect function:

prep <- estimateEffect(1:10 ~ year, stmobj = m2, meta = docvars(sotu_dfm))
summary(prep, topics=1)
## 
## Call:
## estimateEffect(formula = 1:10 ~ year, stmobj = m2, metadata = docvars(sotu_dfm))
## 
## 
## Topic 1:
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -14.191420   3.250115  -4.366 1.38e-05 ***
## year          0.007116   0.001619   4.396 1.21e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And plot the topic prevalences over time:

plot(prep, "year", method = "continuous", topics = c(1,3), model = m2)

Content covariates

Finally, let’s also add President as a content covariate, and model year in a non-linear fashion.

For modeling year, we we use the s() (splines) function to allow for a non-linear effect of year. The spline function creates a number of sub-functons that each capture a ‘period’ of the data, creating different ‘dummy’ variables that can later be used to reconstruct the effect per year:

library(ggplot2)
splines = cbind(data.frame(year=1990:2000), s(1990:2000, 4))
splines = reshape2::melt(splines, id.var="year")
ggplot(splines, aes(x=year, y=value, color=variable, group=variable)) + geom_line()

We add this covariate using the content=~ argument:

m3 = stm(sotu_dfm, K = 10, prevalence =~ s(year, 4), content =~ president, max.em.its = 100)

Now, we can estimate the effect of year in a non-linear fashion:

prep <- estimateEffect(1:10 ~ s(year, 4), stmobj = m3, meta = docvars(sotu_dfm))
plot(prep, "year", method = "continuous", topics = 2, model = m2)

Moreover, if we ask for the top terms, we get the terms per President as well as per topic:

labelTopics(m3)

We can also plot the word use per president as a ‘perspective plot’:

plot(m3, type="perspectives", topics=2)

Correlation structure

Finally, we can plot the correlation structure of the model:

corr = topicCorr(m)
plot(corr)

And more!

STM contains many more useful functions, for example for selecting the best model or K, and determining coherence. See the vignette (from http://www.structuraltopicmodel.com/) and the help files for the stm package!