Structural Topic Models

Structural topic models are an extension of LDA that allow us to explicitly model text metadata such as date or author as covariates of the topic prevalence and/or topic words distributions. In the stm package it has excellent support for R (see (although fitting the models can be a bit slower than fitting regular LDA models)

In this document we will model the State of the Union speeches of Bush and Obama, using the year and President as covariates. First, we read the data into a (quanteda) corpus to make sure the metadata is kept with the text, and then we create a dfm:

data("sotu_texts", package="corpustools")
sotu_texts$id = as.character(sotu_texts$id)
sotu_texts$year = as.numeric(format(sotu_texts$date, "%Y"))
sotu = corpus(sotu_texts, docid_field="id", text_field="text")
sotu_dfm = dfm(sotu, remove_punct=T, remove=stopwords("english"))
sotu_dfm = dfm_trim(sotu_dfm, min_count = 2)
date party president year
111552549 2001-02-27 Republicans George W. Bush 2001
111552556 2001-02-27 Republicans George W. Bush 2001
111552570 2001-02-27 Republicans George W. Bush 2001
111552599 2001-02-27 Republicans George W. Bush 2001
111542780 2001-02-27 Republicans George W. Bush 2001
111542800 2001-02-27 Republicans George W. Bush 2001
Barack Obama George W. Bush
554 536

Now, we can use the stm function to fit a topic model. First, let’s fit one without any covariates:

m = stm(sotu_dfm, K = 10, max.em.its = 100, control=list(alpha=1))

This words similarly to a regular topic model, but it also models correlations between topics. To inspect topic model results, we can use functions from the stm package:

plot(m, type="summary")

labelTopics(m, topic=9)
## Topic 9 Top Words:
##       Highest Prob: energy, new, jobs, years, america, clean, economy 
##       FREX: house, solar, environment, energy, cleaner, clean, oil 
##       Lift: 2017, 75, 80, acres, all-of-the-above, answers, automobiles 
##       Score: house, energy, clean, oil, jobs, solar, renewable

By default, stm gives multiple ways of inspecting top words: simple highest probabibility (the first row) and three ways of finding the most ‘typical’ words by focusing on words that are less common in other topics.

We can also plot the words per topic and the words ‘between’ two topics:

cloud(m, topic=9)

plot(m, type="perspectives", topics=c(4,5))

Prevalence covariate

Now, let’s model year as a covariance of the prevalence:

m2 = stm(sotu_dfm, K = 10, prevalence =~ year, max.em.its = 100)

Besides the functions above, we can now also model the effect of year using the estimateEffect function:

prep <- estimateEffect(1:10 ~ year, stmobj = m2, meta = docvars(sotu_dfm))
summary(prep, topics=1)
## Call:
## estimateEffect(formula = 1:10 ~ year, stmobj = m2, metadata = docvars(sotu_dfm))
## Topic 1:
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -14.191420   3.250115  -4.366 1.38e-05 ***
## year          0.007116   0.001619   4.396 1.21e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And plot the topic prevalences over time:

plot(prep, "year", method = "continuous", topics = c(1,3), model = m2)