Structural Topic Models

Structural topic models are an extension of LDA that allow us to explicitly model text metadata such as date or author as covariates of the topic prevalence and/or topic words distributions. In the stm package it has excellent support for R (see http://structuraltopicmodel.com) (although fitting the models can be a bit slower than fitting regular LDA models)

In this document we will model the State of the Union speeches of Bush and Obama, using the year and President as covariates. First, we read the data into a (quanteda) corpus to make sure the metadata is kept with the text, and then we create a dfm:

library(quanteda)
data("sotu_texts", package="corpustools")
sotu_texts$id = as.character(sotu_texts$id)
sotu_texts$year = as.numeric(format(sotu_texts$date, "%Y"))
sotu = corpus(sotu_texts, docid_field="id", text_field="text")
sotu_dfm = dfm(sotu, remove_punct=T, remove=stopwords("english"))
sotu_dfm = dfm_trim(sotu_dfm, min_count = 2)
head(docvars(sotu_dfm))

	date	party	president	year
111552549	2001-02-27	Republicans	George W. Bush	2001
111552556	2001-02-27	Republicans	George W. Bush	2001
111552570	2001-02-27	Republicans	George W. Bush	2001
111552599	2001-02-27	Republicans	George W. Bush	2001
111542780	2001-02-27	Republicans	George W. Bush	2001
111542800	2001-02-27	Republicans	George W. Bush	2001

table(docvars(sotu_dfm)$president)

Barack Obama	George W. Bush
554	536

Now, we can use the stm function to fit a topic model. First, let’s fit one without any covariates:

library(stm)
m = stm(sotu_dfm, K = 10, max.em.its = 100, control=list(alpha=1))

This words similarly to a regular topic model, but it also models correlations between topics. To inspect topic model results, we can use functions from the stm package:

plot(m, type="summary")

labelTopics(m, topic=9)

## Topic 9 Top Words:
##       Highest Prob: energy, new, jobs, years, america, clean, economy 
##       FREX: house, solar, environment, energy, cleaner, clean, oil 
##       Lift: 2017, 75, 80, acres, all-of-the-above, answers, automobiles 
##       Score: house, energy, clean, oil, jobs, solar, renewable

By default, stm gives multiple ways of inspecting top words: simple highest probabibility (the first row) and three ways of finding the most ‘typical’ words by focusing on words that are less common in other topics.

We can also plot the words per topic and the words ‘between’ two topics:

cloud(m, topic=9)

plot(m, type="perspectives", topics=c(4,5))

Prevalence covariate

Now, let’s model year as a covariance of the prevalence:

m2 = stm(sotu_dfm, K = 10, prevalence =~ year, max.em.its = 100)

Besides the functions above, we can now also model the effect of year using the estimateEffect function:

prep <- estimateEffect(1:10 ~ year, stmobj = m2, meta = docvars(sotu_dfm))
summary(prep, topics=1)

## 
## Call:
## estimateEffect(formula = 1:10 ~ year, stmobj = m2, metadata = docvars(sotu_dfm))
## 
## 
## Topic 1:
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -14.191420   3.250115  -4.366 1.38e-05 ***
## year          0.007116   0.001619   4.396 1.21e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And plot the topic prevalences over time:

plot(prep, "year", method = "continuous", topics = c(1,3), model = m2)

Content covariates

Finally, let’s also add President as a content covariate, and model year in a non-linear fashion.

For modeling year, we we use the s() (splines) function to allow for a non-linear effect of year. The spline function creates a number of sub-functons that each capture a ‘period’ of the data, creating different ‘dummy’ variables that can later be used to reconstruct the effect per year:

library(ggplot2)
splines = cbind(data.frame(year=1990:2000), s(1990:2000, 4))
splines = reshape2::melt(splines, id.var="year")
ggplot(splines, aes(x=year, y=value, color=variable, group=variable)) + geom_line()

We add this covariate using the content=~ argument:

m3 = stm(sotu_dfm, K = 10, prevalence =~ s(year, 4), content =~ president, max.em.its = 100)

Now, we can estimate the effect of year in a non-linear fashion:

prep <- estimateEffect(1:10 ~ s(year, 4), stmobj = m3, meta = docvars(sotu_dfm))
plot(prep, "year", method = "continuous", topics = 2, model = m2)

Moreover, if we ask for the top terms, we get the terms per President as well as per topic:

labelTopics(m3)

We can also plot the word use per president as a ‘perspective plot’:

plot(m3, type="perspectives", topics=2)

Correlation structure

Finally, we can plot the correlation structure of the model:

corr = topicCorr(m)
plot(corr)

And more!

STM contains many more useful functions, for example for selecting the best model or K, and determining coherence. See the vignette (from http://www.structuraltopicmodel.com/) and the help files for the stm package!

Fitting LDA Models in R

Wouter van Atteveldt

February 13, 2018

Structural Topic Models

Prevalence covariate

Content covariates

Correlation structure

And more!