Introduction

This tutorial provides an exemplary implementation of a text analysis method that aims to place documents on a scale (as discussed in Session 5 of the block seminar). In principle, such scaling procedures are extremely attractive for political science: locating the positions of actors (= authors, speakers, …) on specific conflict dimensions is core to most political analyses.

The text analysis procedures for scaling resemble other statistical methods of dimensionality reduction you may have encountered already (think of principal component, factor, cluster, or especially correspondence analysis). They take the varying usage of words as the major characteristic relevant to position documents relative to each other. As discussed in the seminar, whether text scaling methods uncover substantially relevant conflict dimensions hinges on a number of assumptions that need to be checked. So, again, don’t take the analyses on climate change positions in the UNGA provided here at face value but treat them as an exemplary starting point to apply and adapt the provided code to your material.

Prepare R session and data

Like in Tutorial 1 (if you haven’t seen it yet, start there!) we initially attach some add-on packages to our R session, specify the working directory, and load a cleaned version of the UNGD corpus.

# Packages
library(tidyverse) # data management tools, includes stringr for text manipulation and ggplot for plotting
library(quanteda) # efficient implementations of many text-as-data functions
library(coefplot) # to extract coefficients from regression models
library(knitr) # For neat visual output

# Working directory
# setwd("Your/Path")

# Load UNGD text data
load("UNGD-corpus.Rdata")

And once more we use the power of the quanteda package. You know the steps by now: transform the raw texts of the UNGA speeches into a quanteda corpus (qungd), extract a 100-term window around each climate change or global warming reference with the kwic function, and finally store the resulting ‘documents’ in a separate corpus climatecorp.

# A quanteda corpus with year (column 1), ISO country code (7) and country name (9) as document variables
qungd <- corpus(ungd$text, docvars = ungd[ , c(1,7,9)]) 

# Focus on climate change and global warming - 100 term windows around our markers
climate_kw <- kwic(qungd, phrase(c("global warming", "climate change", "global-warming", "climate-change")), valuetype = "fixed", case_insensitive = T, window = 50)

# Clean resulting texts
climate <- data.frame(climate_kw)
climate$text <- paste(" ", climate$pre, " ", climate$post, " ", sep = "") # Combining the text around the markers (without markers themselves)
climate$text <- str_replace_all(climate$text, "(global warming)|(climate change)", " ") # Remove remaining markers (KWIC overlap cases)
climate$pre <- climate$post <- climate$keyword <- climate$from <- climate$to <- NULL

# Add document variables
docids <- docvars(qungd) # From underlying corpus
docids$docname <- row.names(docids) # Document identifiers in corpus
climate <- merge(climate, docids, by = "docname", all.x = T) # Merge
climate$docname <- NULL

# Remove country names (often: Self references)
counnames <- unique(climate$country)
counnames <- gsub(".", "", counnames, fixed = T) # remove dots
counnames <- gsub("&", "", counnames, fixed = T) # Remove & symbol
counnames <- gsub("   ", " ", counnames, fixed = T) # Multiple whitespaces to one
counnames <- gsub(" ", "|", counnames, fixed = T) # Split country names with multiple words
counnames <- paste(counnames, collapse = "|") # to regex with or
counnames <- gsub("||", "|", counnames, fixed = T)

climate$text <- gsub(counnames, " ", climate$text, fixed = F) # replace country names in text with whitspace (takes some time)
climate$text <- gsub("\u00E2", " ", climate$text, fixed = T) # Strange symbol â (encoding issue)
climate$text <- gsub("\u0153", "", climate$text, fixed = T) # Strange symbol œ (encoding issue)


# Aggregate to year and country
climate2 <- aggregate(text ~ year+country+iso3c, paste, collapse = " ", data = climate)

# Turn this into a separate corpus
climatecorp <- corpus(climate2$text, docvars = climate2[, c(1:3)])

Wordfish: Unsupervised scaling

The Wordfish algorithm is a text scaling procedure originally introduced by Slapin and Proksch (2008) to study positions of political parties on the basis of their manfestos or speeches.
It is ‘unsupervised’ in the sense that the analyst does not have to specify any anchors a priori. Rather the procedure places the texts under analysis on one (and only one!) latent dimension along a statistical model of relative word frequencies (assuming a Poisson distribution of these frequencies - thus the name!). This model identifies words that discriminate most strongly among texts from different authors (parties, originally).
What the thus retrieved dimension actually represents has to be interpreted by the researcher ex post, primarily by contextualising the discriminatory weights of individual terms (captured in the ‘beta’ parameter) and the final positions of documents (captured in the ‘theta’ parameter).

Against this basic idea, the approach should work best in documents that use very similar language (but with varying frequency) and that are very focussed on the conflict dimension a researcher is interested in. Beyond party conflict, for example, Wordfish has been very sucessfully applied to study interest group influence by comparing the position papers these groups submitted on the same legislative initiative (Klüver (2008)). In this setting it is very plausible to assume that the texts use a similar style, and similar concepts with varying relative emphasis.
It should not be taken for granted that our example on text around climate change references in UNGA speeches meets these demands of the model as well. But let’s play with this algorithm a little bit to see (and to enable you to adapt it in more focussed research projects!).

Text pre-processing

Wordfish is very sensitive to your pre-processing choices. On the one hand, the algorithm requires quite some computing power as it has to maximize word weights across the full frequency matrix (size: # of documents x # of unique terms). Any textual information that is not strictly informative is thus usually removed or reduced. Typical steps include:

Remove so called ‘stopwords’, i.e. the most common words in the respective language that often have only grammatical functions but bear no substantitve information (e.g. ‘the’, ‘and’, ‘of’). qunateda conveniently provides respective lists for various languages.
Stem varying word inflections to a common root (e.g. “depletion”|“deplete”|“depleted” <- “deplet”). This reduces the size of the matrix drastically. But it also assumes that different inflections refer to the same concept.

On the other hand, being a purely inductive method, the results of a Wordfish estimation will vary with the set of terms that you feed into it. So your choices in this regard are crucial. In particular, Wordfish is sensitive towards outlier terms that are used only very rarely or by single authors only. Sometimes such outliers are thus removed (e.g. filtering out all terms that occur in less than 1% of documents). However, depending on your research question, such outliers might be very informative! In our setting we keep them in.

The following step constructs a document frequency matrix from our climatecorp corpus and applies the pre-processing steps by using the respective arguments quanteda’s dfm function provides.

quanteda_options(language_stemmer = "english")
climatedfm <- dfm(climatecorp, remove = stopwords("english"), tolower = TRUE, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, stem = TRUE, verbose = TRUE)

Because the scaling algorithm requires enough (overlapping) term frequencies and is computationally intensive, we also pool all text windows around climate change references to one document by country. This, however, is also a substantially relevant choice: it assumes that country positions on the underlying dimensions are by and large fixed and makes it impossible to study changing country positions over time, for example.
*The baseline message here is: your pre-processing choices matter for the results of a Wordfish estimation and you should take them consciously in the light of your specific theoretical needs and the structure of texts you want to analyze.“*

climatedfm2 <- dfm_group(climatedfm, groups = docvars(climatedfm, "iso3c"))

Apply and evaluate the algorithm

Now we apply the Wordfish algorithm as implemented in quanteda’s textmodel_wordfish function to this matrix and store the estimated model in the object wf.

wf <- textmodel_wordfish(climatedfm2, dir = c(1, 2))

Before looking at the resulting country positions we should first try to understand the content of the dimension that differentiates the country documents from each other according to Wordfish. This interpretation is up to the researcher. But we can often learn from the term-level weights that pull a document towards either the ‘left’ or the ‘right’ on the latent dimensions. So the following steps extract and analyse these weights form the stored wf model.

terms <- data.frame(wf$features, wf$beta, wf$psi, stringsAsFactors = F) # Extract term-level infor from wf model
names(terms) <- sub("wf.", "", names(terms), fixed = T) # Clean names
terms <- terms[order(terms$beta), ] # Order by word weight
# view(terms) # Have a look at the list

# Top 25 most 'left' words
kable(head(terms, 25))

	features	beta	psi
4839	injur	-4.541910	-13.147215
3960	bolivarian	-3.324486	-9.109812
3946	yasunã	-2.774761	-6.841437
3927	itt	-2.760789	-6.917320
3967	file	-2.693335	-7.185891
3968	penguin	-2.693335	-7.185891
3951	magic	-2.680410	-8.063755
3926	yasuni	-2.597361	-7.412858
3942	correa	-2.597361	-7.412858
3970	spheniscus	-2.597361	-7.412858
3971	mendiculus	-2.597361	-7.412858
3972	galapago	-2.597361	-7.412858
6422	5th	-2.554970	-8.387276
6423	metaphys	-2.554970	-8.387276
6424	reput	-2.554970	-8.387276
6425	3rd	-2.554970	-8.387276
6426	iglitz	-2.554970	-8.387276
6427	batteri	-2.554970	-8.387276
6428	version	-2.554970	-8.387276
6429	prefabr	-2.554970	-8.387276
6430	cortã	-2.554970	-8.387276
6431	pizarro	-2.554970	-8.387276
6432	edgar	-2.554970	-8.387276
6433	morin	-2.554970	-8.387276
6434	petrocarib	-2.554970	-8.387276

# Top 25 most 'right' words
kable(tail(terms, 25))

	features	beta	psi
6419	oam	3.186401	-12.45744
6420	turtl	3.186401	-12.45744
6421	tighten	3.186401	-12.45744
6347	apocalyps	3.771627	-13.68240
6354	buffet	3.771627	-13.68240
6362	contour	3.771627	-13.68240
6380	rudderless	3.771627	-13.68240
6385	painless	3.771627	-13.68240
6391	dither	3.771627	-13.68240
6392	bucket	3.771627	-13.68240
6393	bake	3.771627	-13.68240
6412	barefac	3.771627	-13.68240
6413	hoax	3.771627	-13.68240
6416	disavow	3.771627	-13.68240
6417	mint	3.771627	-13.68240
6348	banana	4.117946	-14.44545
6363	blameless	4.117946	-14.44545
6369	ruinous	4.117946	-14.44545
6381	vacuous	4.117946	-14.44545
6382	insincer	4.117946	-14.44545
6383	lip	4.117946	-14.44545
6414	clichã	4.117946	-14.44545
6415	truism	4.117946	-14.44545
6353	skelet	4.367658	-15.01112
6386	meander	4.564319	-15.46500

Mmmh … can you come up with a good interpreation of a dimension spanning between these word choices? Among the most extreme ‘left’ terms (or word stems, to be more precise) we find a number of terms pointing towards South America (bolivarian, yasuni, galapago, pizzaro, petrocaribe). On the ‘right’ side no immediately clear pattern emerges. We find some word stems that occur in political conflicts about climate change such as ‘hoax’, ‘truism’, or ‘apocalyps’ but it is not immediately clear where they point to.

Another way to analyse term-level information from a Wordfish model is the so-called Eiffeltower plot that can be called with quanteda’s textplot_scale1d function. It plots terms weights against (estimated) word frequencies which captures the Wordfish intuition that rare words used by only few documents should be most informative to retrieve a conflict dimension. The following steps produce such a plot and highlight selected terms therein.

# Mark terms to highlight in plot
hl <- c("island", "predatori", "amazonian", "holocaust", "desert", "capitalist", "drought", "global", 
        "develop", "nation", "paradis", "nationalist", "continent", "glacier", "africa", "barrel", "oil", "coal",
        "sea", "temperatur", "reloc", "carbon", "economi", "disbelief", "pizarro", "yasunã", "pacific", "megastorm", 
        "tuvalu", "micronesia", "solomon", "coast", "truism", "apocalyps", "costal", "consumerist", "tsunami", "exposur", "coral", "anthropogen", 
        "scapegoat", "quota", "emiss", "exodus", "fossil", "scholarship")

# Plot distribution of term level distribution
textplot_scale1d(wf, margin = "features", highlighted = hl)+
  labs(x= "Estimated beta\n(Discriminatory word weights)",
       y= "Estimated psi\n(~ Word frequencies within and across documents)",
       title = "Wordfish: Estimated word stem positions in UNGA climate change talk")

ggsave(file = "PL_WordfishTermWeights.png", width = 28, height = 20, units = "cm") # Export plot

Here we initially see that terms (stemmed!) like ‘develop’, ‘nation’, ‘global’ but also ‘economi’ or ‘carbon’ bear little information for the Wordfish scaling. They occur frequently within and across climate-change related speech acts in the UNGA and thus do not help in differentiating documents.
In contrast, word stems such as ‘capitalist’ or ‘nationalist’, but also ‘desert’, ‘oil’, ‘barrel’, as well as ‘amazonian’ or ‘pizzaro’ pull documents to the left of the latent dimension. Again this hints towards South America and especially to oil-producing countries with left-populist governments from which we might expect such terminology.
Words that pull a document to the ‘right’ are ‘island’, ‘sea’ and ‘coral’, ‘anthropogen’, ‘scholarship’, but especially ‘consumerist’, ‘costal’, or ‘apocalyps’ or ‘truism’. These are words we would probably expect from delgates speaking about human-made climate change and the resulting threats for costal regions and island states in particular.

Country positions according to Wordfish

With this information in mind, we now derive the document-level positions Wordfish has estimated (‘thetas’) and plot the 25 ‘right-most’ and the 25 ‘left-most’ countries on this dimension. The confidence intervals Wordfish provides are derived from bootstrapping across the word frequency distributions in each document.

climate3 <- wf$x@docvars # The document variables (country, iso code)
climate3$theta <- as.numeric(wf$theta) # Estimated document position
climate3$se.theta <- as.numeric(wf$se.theta) # Bootstrapped standard error of word weigth
climate3$error <- qnorm(0.95)*climate3$se.theta # 95% confidence error
climate3$hi <- climate3$theta + climate3$error # 95% confidence upper bound
climate3$lo <- climate3$theta - climate3$error  # 95% confidence lower bound

climate3 <- climate3[order(climate3$theta), ] # Order by document position
low25 <- climate3[1:25, ]
low25$group <- "25 left-most countries"

climate3 <- climate3[order(-climate3$theta), ] # Order by document position (descending)
top25 <- climate3[1:25, ]
top25$group <- "25 right-most countries"

cases <- rbind(top25, low25)
cases$group <- factor(cases$group, levels = c("25 right-most countries", "25 left-most countries"))

ggplot(data = cases, aes(x = reorder(country, theta), y = theta))+
  geom_hline(yintercept = mean(climate3$theta), linetype = "dashed")+
  geom_pointrange(aes(ymin = lo, ymax = hi, colour = theta - mean(climate3$theta)))+
  scale_colour_gradient(low = "#0380b5", high = "#9e3173",
                        guide = "colourbar", aesthetics = "colour", name = "Distance from average climate change sentiment: ")+
  labs(title = "Wordfish positions of climate change related speeches in UN General Assembly",
       subtitle = "Wordfish theta estimates of 100-term windows around 'climate change' / 'global warming' pooled by country",
       x = "",
       y = "")+
  coord_flip()+
  facet_wrap(vars(group), scales="free_y", ncol = 1, strip.position="right")+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 0, vjust = .5),
        text=element_text(family = "serif"),
        axis.line = element_line(colour = "black"),
        plot.title =element_text(size=12, face='bold'),
        axis.text.y = element_text(size = 10, face = "bold"),
        strip.text = element_text(face = "bold", size = 10),
        legend.position = "none")

ggsave(file = "PL_WordfishAcrossSelectedCountries.png", width = 24, height = 18, units = "cm") # Export to file

From this plot it becomes immediately clear that the 25 contries that Wordfish places most extremely ‘right’ on the inductively derived dimension are clearly island states only.
The pattern among the most ‘left’ countries appears a bit more mixed at first sight. As expected from the term-level analysis we find South-American states such as Venezuela, Ecuador, Colombia, Bolivia, and Paraguay. But Wordfish groups them together with many states from Northern and Central Africa or Central Asia. Beyond geographical position, the list seems to contain many medium size oil exporters.

To get additional information on what might be in this inductively derived dimension, the subsequent steps again employ the crude regression approach you know from the earlier tutorials already.

load("cmeans.Rdata") # Country data 1987-2017 averages, collected from World Bank, Wikipedia, Polity, and Freedom House (crude!)
countries <- merge(climate3, cmeans, by = "iso3c", all.x = T) # merge with the aggregated sentiment data
countries <- countries[complete.cases(countries), ] # Keep only countries with information on all variables

# Regression model
fit <- lm(scale(theta)~scale(gdpc) + scale(co2emiss) + scale(fuel.exp) + scale(democracy) + scale(eq.dist) + scale(below5) + scale(island), data = countries)
summary(fit)

## 
## Call:
## lm(formula = scale(theta) ~ scale(gdpc) + scale(co2emiss) + scale(fuel.exp) + 
##     scale(democracy) + scale(eq.dist) + scale(below5) + scale(island), 
##     data = countries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.26225 -0.43558  0.08298  0.48275  1.96951 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -2.210e-16  5.304e-02   0.000  1.00000    
## scale(gdpc)      -1.535e-01  1.133e-01  -1.355  0.17715    
## scale(co2emiss)   1.505e-01  1.139e-01   1.321  0.18809    
## scale(fuel.exp)  -1.417e-01  6.496e-02  -2.182  0.03039 *  
## scale(democracy)  1.374e-01  6.569e-02   2.092  0.03780 *  
## scale(eq.dist)   -7.025e-02  6.482e-02  -1.084  0.27989    
## scale(below5)     1.867e-01  5.663e-02   3.296  0.00118 ** 
## scale(island)     4.996e-01  6.203e-02   8.054 9.96e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7331 on 183 degrees of freedom
## Multiple R-squared:  0.4824, Adjusted R-squared:  0.4626 
## F-statistic: 24.36 on 7 and 183 DF,  p-value: < 2.2e-16

# Coefficient data
reg.data <- coefplot(fit, plot = FALSE) # Get coefficient data
reg.data <- reg.data[reg.data$Coefficient != "(Intercept)", ] # Uninformative in a standardized model
reg.data$name <- c("GDP per capita", "CO2 emissions per capita", "Fossil fuel exports", "Democracy", "Distance to equator", "Area below 5m (%)", "Island state") # Human readable names
reg.data <- reg.data[order(reg.data$Value), ] # Sort in ascending effect size
reg.data$name <- factor(reg.data$name, levels = reg.data$name)

# And plot
ggplot(data = reg.data, aes(x=name, y = Value))+
  geom_hline(yintercept = 0, linetype = "dashed", colour = "red", size = 1)+
  geom_linerange(aes(ymin = LowOuter, ymax = HighOuter), size = .8, colour = "#0380b5")+
  geom_pointrange(aes(ymin = LowInner, ymax = HighInner), size = 1.5, colour = "#0380b5")+
  labs(title = "How do states position themselves on climate change?",
       subtitle = "Explaining the estimated Wordfish position around climate change references by some crude national-level variables",
       y = "Standardized regression coefficient\nwith 95 and 99% confidence intervals\n",
       x = "",
       caption = "Linear Model, n = 191 countries, Adj. R2: .46.")+
  coord_flip()+
  theme_bw()+
  theme(axis.text.x = element_text(size = 14, angle = 0, vjust = .5),
        axis.title = element_text(size = 14),
        axis.text.y = element_text(size = 14, face = "bold", vjust = .3),
        text=element_text(family = "serif"),
        axis.line = element_line(colour = "black"),
        plot.title =element_text(size=16, face='bold'),
        plot.subtitle = element_text(size=14),
        plot.caption = element_text(size = 14))

ggsave(file = "PL_WordfishOverCountries_Model.png", width = 30, height = 18, units = "cm") # Export to file

So, being an island state or having a large share of surface area below 5m means - with high statistical certainty - that you are further to the ‘right’ on the dimension Wordfish has found. This holds for democracies, too, (though there is more variation here). Being an exporter of fossil fuels, in contrast, means that you are almost certainly on the ‘right’ side of this dimension.

Together with the insights above we may thus cautiously conclude that the dimension uncovered by Wordfish spans from those states that are most severely hit by climate change consequences to those that would be most severely hit by climate change mitigation measures. But we have also seen that this result might be partially driven by specific word usage of South-American delegates as well, thus blending in other ‘true’ conflict dimensions as well.

Whether the uncovered dimension matters and whether the measure is ‘valid enough’ to be used in further analyses, then, depends primarily on your specific research question and theoretical demands!

If you are interested in other dimensions, you might want to apply a correspondence analysis to the matrix of document-term frequencies. This allows you to model several dimensions of clustered word usage at once and is conveniently implemented in quanteda’s textmodel_ca function.

Or you head over to the next tutorial: There we implement Wordscores, a scaling method that is based on explicit anchors that the researcher supplies to identify an a priori defined dimension…

Tutorial for Session 5 (Wordfish) Climate change positioning in UN General Assembly Speeches - an unsupervised approach

Text-mining international politics Block seminar at the Charles University Prague, May 2019