Introduction

In this tutorial we will estimate some (very crude and simple) topic models on the basis of texts around climate change references in the United Nationas General Assembly we have identified earlier.

Topic models are a set of ‘bag of words’ algorithms originally intended to structure and to search large digital text collections. They rest on the assumption that authors initially decide on the (mix of) topics a text should cover and then decide about the appropriate distribution of words in the text. The algorithms reverse engineer this assumed process, optimizing a function that maximizes the likelihood of the term distribution under a fixed number of latent topics. They build inductively on the observed co-occurences of individual terms, and summarize the statistically optimal patterns which - in principle - represents interpretable categorizations of individual documents.

The basic idea to identify latent and possibly overlapping in text along this way seems initially very promising for political scientists as well: Topic models may offer a rather efficient way of tracking political attention to something like ‘topics’ across time or actor types, e.g. However, as discussed extensively during the seminar, topic models have very high ex-post interpretation demands and - as an essentially inductive method - are very sensitive to the input the researcher provides. To make you further aware of these issues and to enable you to implement topic models in R (and to tweak them further!), the following pages provide exemplary (yet hardly very valid!) approaches…

Prepare the R session

Again we load the packages needed for the subsequent steps (and again you might need to install some of them first via install.packages("package name")), tell R about our working directory, and load the UNGD corpus.

# Packages
library(tidyverse) # data management tools, includes stringr for text manipulation and ggplot for plotting
library(tidytext) # a text mining infrastructure exploiting the tidy data principles
library(quanteda) # efficient implementations of many text-as-data functions
library(topicmodels) # implementation of standard topic model algorithms
library(knitr) # for neat visual output
library(reshape2) # for reshaping data frames

# Working directory
# setwd("Your/Path")

# Load UNGD text data
load("UNGD-corpus.Rdata")

As earlier, we then extract the 100-term window around climate change / global warming references and store it in objects for use with the quanteda package.

# A quanteda corpus with year (column 1), ISO country code (7) and country name (9) as document variables
qungd <- corpus(ungd$text, docvars = ungd[ , c(1,7,9)]) 

# Focus on climate change and global warming - 100 term windows around our markers
climate_kw <- kwic(qungd, phrase(c("global warming", "climate change", "global-warming", "climate-change")), valuetype = "fixed", case_insensitive = T, window = 50)

# Clean resulting texts
climate <- data.frame(climate_kw)
climate$text <- paste(" ", climate$pre, " ", climate$post, " ", sep = "") # Combining the text around the markers (without markers themselves)
climate$text <- str_replace_all(climate$text, "(global warming)|(climate change)", " ") # Remove remaining markers (KWIC overlap cases)
climate$pre <- climate$post <- climate$keyword <- climate$from <- climate$to <- NULL

# Add document variables
docids <- docvars(qungd) # From underlying corpus
docids$docname <- row.names(docids) # Document identifiers in corpus
climate <- merge(climate, docids, by = "docname", all.x = T) # Merge
climate$docname <- NULL

We also clean and aggregate the data a bit … note that these steps affect our results later and might make no sense for a given substantial research question. Take such choices consciously in a real project.

# Remove country names (often: Self references)
counnames <- unique(climate$country)
counnames <- gsub(".", "", counnames, fixed = T) # remove dots
counnames <- gsub("&", "", counnames, fixed = T) # Remove & symbol
counnames <- gsub("   ", " ", counnames, fixed = T) # Multiple whitespaces to one
counnames <- gsub(" ", "|", counnames, fixed = T) # Split country names with multiple words
counnames <- paste(counnames, collapse = "|") # to regex with or
counnames <- gsub("||", "|", counnames, fixed = T)

# replace country names in text with whitspace (takes some time)
climate$text <- gsub(counnames, " ", climate$text, fixed = F) 

climate$text <- gsub("\u00E2", " ", climate$text, fixed = T) # Strange symbol â (encoding issue)
climate$text <- gsub("\u0153", "", climate$text, fixed = T) # Strange symbol œ (encoding issue)

# Aggregate to year and country
climate2 <- aggregate(text ~ year+country+iso3c, paste, collapse = " ", data = climate)

# Turn this into a separate qunateda corpus
climatecorp <- corpus(climate2$text, docvars = climate2[, c(1:3)])

A first ‘hip shot’ topic model

As a ‘bags of words’ approach, topic models are estimated on the basis of a document-term-frequency matrix (DFM). We build one from our corpus here with the respective quanteda function. Note that there is some mild, nevertheless probaly relevant pre-processing involved (stopword and number removal,e.g.).

climatedfm <- dfm(climatecorp, remove = stopwords("english"), 
                  tolower = TRUE, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, 
                  stem = FALSE)

quanteda itself doesn’t feature topic model estimation but offers a convenient conversion function to transform the DFM to tm object, a text mining suite that offers the topic model estimation function we will use below.

climatedfm.tm <- convert(climatedfm, to = "topicmodels")

Now we can estimate the basic Latent Dirichlet Allocation topic model by using the respective function rom the topicmodelspackage.
We - arbitrarily - assume that our climate-change related texts contain five different topics. Note that topic models present a maximum likelihood estimation. The results might slightly change in dependence on the starting points for the algorithm. To have reproducible results, I fix a corresponding seed parameter directly in the function call. Note that the LDA estimation may take some time …

ldafit <- LDA(climatedfm.tm, k = 5, control = list(seed = 29042019))

Now we can look for the top-ten terms in each topic by terms(ldafit, 10). To aid interpretation further, however, we also want to have a look at the estimated probabilities that a term is generated from each of the latent topics. To visualise this, I borrow the approach from Julia Silge’s and David Robinson’s tidytext book (which might be of interest to you!).

First we extract the probabilities for each term in each topic (the so-called betas) from the fitted LDA object.

topics <- tidy(ldafit, matrix = "beta")

Then we extract the top-15 terms with the highest beta per topic (ìf you want to go into detail, the code exploits functions from the dplyr package that is part of the tidyverse suite we have attached above).

top_terms <- topics %>%
  group_by(topic) %>% # Group by each of the five topics
  top_n(15, beta) %>% # Extract the top-15 terms with the highest beta per topic
  ungroup() %>% # Ungroup
  arrange(topic, -beta) # Order by topic and beta values

# Have a look at the data structure (first ten rows)  
kable(head(top_terms))

topic	term	beta
1	development	0.0159371
1	countries	0.0158269
1	climate	0.0145704
1	nations	0.0133312
1	convention	0.0108202
1	framework	0.0106055

Then we plot these top-15 terms (again exploting dplyr for sorting the data and ggplot for the visualisation).

top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free", ncol = 5) +
  labs(y="Beta\n(Probability that a term is generated from this latent topic)",
       x= "")+
  coord_flip()

# ggsave(file = "ClimateTopics_5.png", width = 26, height = 12, units = "cm") # If you want to export the plot

And only at this stage, the topic modelling demands the analyst’s interpretation. Can you make sense of these patterns? Can you spot ‘topics’ from these word co-occurence patterns that would make substantial sense in the context of climate change talk in the UN General Assembly?

Admittedly I have a hard time interpreting these patterns substantially (which might also be due to a lack of contextual knowledge, of course). However, one might suspect that topic 1 has something to do with the Kyoto Protocol while the terms that ‘load’ high on topic 2 have something to do with the Paris Agreement on Climate Change. (And as in the oother tutorials, the island states figure in topic 3 again).

[Note that this observation points to another problem of interpreting topic model outputs: What exactly makes a topic a topic in substantial terms? One might argue, for example, that international agreements are a substantially relevant ‘topic’ in international discourse but the way we have set up the model, it results into one topic per agreement. The change of topic-specific language over time that this example suggests is an issue that so-called Dynamic Topic Models aim to adress (see, e.g., Greene and Cross (2017)).]

In any case, interpreting ‘topics’ requires justification and ideally validation. With regard to our hunch that we have a ‘Kyoto’ and ‘Paris’ topic we would expect that the prevalence of these topics varies systematically with the negoatiation of these agreements. , for example. To study this, we first use a function from the topicmodelspackage that predicts the most likely topic for each of our documents based on the terms it contains.
Since the order of documents remained unchanged throughout the earlier steps, we can directly write the predicted topic into the data frame containing our raw climate change texts.

climate2$main.topic <- topics(ldafit) # # Predict main topic for each text
climate2$topic1 <- as.numeric(climate2$main.topic == 1) # variable to store whether main predicted topic is topic 1 (Kyoto)
climate2$topic2 <- as.numeric(climate2$main.topic == 2) # variable to store whether main predicted topic is topic 2 (Paris)

The we build a datasets storing just the number of each of the two topics per year (in long format preferred by the ggplot syntax) and visualise the absolute annual frequency of the two topics over time.

topicfreq <- climate2[,c("year", "topic1", "topic2")]
topicfreq <- melt(topicfreq, id.vars = "year")

topicfreq$variable <- as.character(topicfreq$variable)

topicfreq$variable[topicfreq$variable == "topic1"] <- "Topic 1 (Kyoto protocol?)"
topicfreq$variable[topicfreq$variable == "topic2"] <- "Topic 2 (Paris agreement?)"

ggplot(data= topicfreq, aes(x=year, y=value, colour = variable))+
  stat_summary(geom = "line", fun.y = "sum")+
  scale_x_continuous(breaks = seq(1970, 2015, 5), minor_breaks = seq(1970, 2017, 1))+
  labs(title = "Freqency of estimated topics 1 and 2",
       subtitle = "",
       y = "Number of climate change references with topic\n",
       x = "",
       color = "Estimated main topic: ")+
  theme_bw()+
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 0, vjust = .5),
        text=element_text(family = "serif"),
        panel.border = element_blank(),
        axis.line = element_line(colour = "black"),
        plot.title =element_text(size=12, face='bold'))

# ggsave(file = "PL_CC-TopicsOverTime.png", width = 18, height = 12, units = "cm") # if you want to export the plot

There is some evidence to suggest that these topics tap into the political salience of the two agreements. The red line, indicating the prevalence of topic 1, has a local peak in 1997 when the Kyoto Protocol was signed, another one in 2007 just before the first commitment period started, and another one in 2013, when the second commitment period expired and various states declined to take further targets. The blue line, indicating topic 2 prevalence, peaks mainly in 2015 and 2016 when the Paris agreement was signed.
However, there are also a number of speeches with climate change references that are classified under either of these topics outside of these negotiation events. Whether this variation is substantially meaningful and interpretable then should be subject to further validation …

Model sensitivity I : Specifying more topics

Interpreting the output of topic models is further aggravated by the fact that topic models are extremely sensitive to the choices about the the input parameters the researcher supplies prior to the estimation. To demonstrate that, we repeat the above steps but - again arbitrarily - let the LDA estimation optimize term distributions for 10 topics instead of the five we have analyzed above.

ldafit <- LDA(climatedfm.tm, k = 10, control = list(seed = 29042019)) # Fit a 10-topic LDA model

topics <- tidy(ldafit, matrix = "beta") # Extract the probabilities for each term in each topic (betas) from the fitted LDA object

top_terms <- topics %>%
  group_by(topic) %>% # Group by each of the five topics
  top_n(15, beta) %>% # Extract the top-15 terms with the highest beta per topic
  ungroup() %>% # Ungroup
  arrange(topic, -beta) # Order by topic and beta values

# And plot
top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free", ncol = 5) +
  labs(y="Beta\n(Probability that a term is generated from this latent topic)",
       x= "")+
  coord_flip()

# ggsave(file = "ClimateTopics_10.png", width = 26, height = 24, units = "cm") # If you want to export the plot

Compared to the output of the five-topic model, this does not make the job of interpreting the term co-occurences in substantial terms much easier. Topic 7, for example, initially looks like a more general ‘agreement’ or ‘treaty’ topic. But terms like ‘PAris’ or ‘agreement figure prominently in other ’topics’ as well. Again we also see lots of term overlap across topics. For some research questions such overlap might be informative but it may also suggest that UNGA representatives’ climate change talk does not offer as much substantial variation across topics as we assume by this model specification.

Methodologically, the key message is that the number of topics specified can lead to different substantial interpretations. Thus, in a normal project the starting parameters have to be justified and the robustness of the results should be assessed empirically. While there is no hard qualitative criterion to specify the ‘correct’ number of topics a priori, the recent literature has delved into statistical measures of model fit in this regard. If you are interested, the ldatuningand stm packages provide good starting points.

Model sensitivity II: Feature selection

The output of topic models is furthermore highly sensitive to the text input the resaercher supplies. Especially against the high term overlap observed above, one might think about selecting more informative terms (in machine learning often called ‘features’ more generally). Being interested in politically selective topic emphasis, one might, for example, be interested in terms that are used very frequently but only rarely across documents (and thus across different speakers, cf. scaling).
For demonstrating such an approach, we ‘trim’ our original document frequency matrix and retain only terms from the top 5 percentile of most frequent words in our texts that should appear in 10 percent of the documents at max, however (again these tresholds are completely arbitrary and for demonstrative purposes only).

climatedfm2 <- dfm_trim(climatedfm, min_termfreq = 0.95, termfreq_type = "quantile", # Trim DFM
                        max_docfreq = 0.1, docfreq_type = "prop")
climatedfm.tm <- convert(climatedfm2, to = "topicmodels") # Convert to tm object

Then we fit a 5-topic LDA model on this data and consider it’s term-topic-level output along the steps introduced above.

ldafit <- LDA(climatedfm.tm, k = 5, control = list(seed = 29042019)) # Fit the model on the trimmed DFM
topics <- tidy(ldafit, matrix = "beta") # Extract the probabilities for each term in each topic (betas) from the fitted LDA object

top_terms <- topics %>%
  group_by(topic) %>% # Group by each of the five topics
  top_n(15, beta) %>% # Extract the top-15 terms with the highest beta per topic
  ungroup() %>% # Ungroup
  arrange(topic, -beta) # Order by topic and beta values

top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free", ncol = 5) +
  labs(y="Beta\n(Probability that a term is generated from this latent topic)",
       x= "")+
  coord_flip()

# ggsave(file = "ClimateTopics_5_FeatureSelection.png", width = 26, height = 12, units = "cm") # if you want to export the plot

This looks very different. The terms appear more specific both in qualitative terms as well as across topics. Whether this is ‘correct’ and can or should be interpreted substantially ultimately depends on your research question. The key take-away point here is that the output of the topic model- an essentially inductive excercise is highly sensitive to the inputs (here the selection of features) the researcher supplies.
If you want to know how such pre-processing steps affect the interpretations in your projects, the preText package has you covered!

So, go ahead and make it better!

p.s.: You might find further inspiration in Slava Mikhaylov’s interactive tool based on a 30-topic model on the full UNGA speeches 1970-2014…

Tutorial for Session 4 Topic modelling climate change talk in the UN General Assembly

Text-mining international politics Block seminar at the Charles University Prague, May 2019

Christian Rauh (2019-04-30 18:10:12)

Introduction

Prepare the R session

A first ‘hip shot’ topic model

Model sensitivity I : Specifying more topics

Model sensitivity II: Feature selection