Introduction

This tutorial provides an exemplary implementation of a typical application of dictionary-based text analysis: retrieving the sentiment expressed in documents. The positivity or negativity a speaker or author expresses in her or his political communication is possibly relevant evidence for many political science arguments.

So let us have a look at one way of implementing such analyses in R while also adressing some of the validity pitfalls mentioned during the seminar. We exploit our running example of climate change talk in the United Nationas General Assembly. Again, these analyses should not be taken as final - rather you are cordially invited to push them further on your own!

Prepare R session and data

Like in the first tutorial (on session 2, if you haven’t seen it yet, start there!) we initially attach some add-on packages to our R session, specify the working directory, and load a cleaned version of the UNGD corpus.

# Packages
library(tidyverse) # data management tools, includes stringr for text manipulation and ggplot for plotting
library(quanteda) # efficient implementations of many text-as-data functions
library(coefplot) # to extract coefficients from regression models

# Working directory
# setwd("Your/Path")

# Load UNGD text data
load("UNGD-corpus.Rdata")

And again we heavily rely on the object classes and functions provided in the quanteda package. Equivalent to the first tutorial, the following steps transform the raw texts of the UNGA speeches into a quanteda corpus we call qungd, then extract a 100-term window around each climate change or global warming reference therein via the kwic function, and finally store the resulting ‘documents’ in a separate corpus object called climatecorp.

# A quanteda corpus with year (column 1), ISO country code (7) and country name (9) as document variables
qungd <- corpus(ungd$text, docvars = ungd[ , c(1,7,9)]) 

# Focus on climate change and global warming - 100 term windows around our markers
climate_kw <- kwic(qungd, phrase(c("global warming", "climate change", "global-warming", "climate-change")), valuetype = "fixed", case_insensitive = T, window = 50)

# Clean resulting texts
climate <- data.frame(climate_kw)
climate$text <- paste(" ", climate$pre, " ", climate$post, " ", sep = "") # Combining the text around the markers (without markers themselves)
climate$text <- str_replace_all(climate$text, "(global warming)|(climate change)", " ") # Remove remaining markers (KWIC overlap cases)
climate$pre <- climate$post <- climate$keyword <- climate$from <- climate$to <- NULL

# Add document variables
docids <- docvars(qungd) # From underlying corpus
docids$docname <- row.names(docids) # Document identifiers in corpus
climate <- merge(climate, docids, by = "docname", all.x = T) # Merge
climate$docname <- NULL

# Aggregate to year and country
climate2 <- aggregate(text ~ year+country+iso3c, paste, collapse = " ", data = climate)

# Turn this into a separate corpus
climatecorp <- corpus(climate2$text, docvars = climate2[, c(1:3)])

Sentiment analysis

Dictionary-based sentiment analyses derive the positivity/negativity conveyed by a text through counting and aggregating the occurences of positively or negatively rated words from a pre-defined sentiment dictionary.

To the extent that we can assume that the texts under analysis focus primarily on the political object of interest (climate change in our running example), the resulting sentiment scores could be interpreted as the degree of support/opposition a text expresses towards that object. This assumption works in some contexts (see e.g. here) but should not be taken for granted as we will see below. In our application, for example, the choice of 100-term windows is somewhat arbitrary and one could think of more focussed approaches to unitizing (sentences, sentences in which climate change is the grammatical object, and so on). But for the sake of simplicity, let’s assume that this window roughly denotes what a national delegate has to say on or in the context of climate change (your are welcome to apply the code presented here to more refined windows, of course).

The sentiment dictionary

The validity of a sentiment analysis stands and falls with the quality and the suitability of the dictionary used. Words that convey positivity in some contexts might be irrelevant or even associated with negative connotations in other contexts. So you should ensure that the dictionary you use matches the (type of) language in the political setting you want to analyse.

For our example, we resort to the Lexicoder Sentiment Dictionary (LSD) provided by Lori Young and Stuart Soroka (2011) because this particular dictionary has been extensively validated against how human readers assess political news in the English language (Proksch et al have recently validated translated versions of this dictionary as well). Conveniently, the LSD dictionary is also directly shipped with the quanteda package. Let’s store this dictionary in a dedicated object and look at it’s structure and some sample terms.

lsd <- data_dictionary_LSD2015 # LSD as shipped with quanteda
str(lsd)

## Formal class 'dictionary2' [package "quanteda"] with 2 slots
##   ..@ .Data       :List of 4
##   .. ..$ :List of 1
##   .. .. ..$ : chr [1:2858] "a lie" "abandon*" "abas*" "abattoir*" ...
##   .. ..$ :List of 1
##   .. .. ..$ : chr [1:1709] "ability*" "abound*" "absolv*" "absorbent*" ...
##   .. ..$ :List of 1
##   .. .. ..$ : chr [1:1721] "best not" "better not" "no damag*" "no no" ...
##   .. ..$ :List of 1
##   .. .. ..$ : chr [1:2860] "not a lie" "not abandon*" "not abas*" "not abattoir*" ...
##   ..@ concatenator: chr " "

sample(lsd$positive, 10) # Random terms rated as positive

##  [1] "picturesqu*"  "kindness*"    "conservation" "suitably"    
##  [5] "educate*"     "assent*"      "tact"         "veritab*"    
##  [9] "sweetly"      "sweeten*"

sample(lsd$negative, 10) # Random terms rated as negative

##  [1] "lag"         "domineer*"   "robs"        "trespass*"   "smasher*"   
##  [6] "tamper*"     "browbeaten*" "conviction*" "illness*"    "indebted*"

Note that the asteriks provides a wildcard that - in the quanteda context with valuetype = 'glob' - matches zero ore more characters so that ’deplet*‘would match both ’depleted’ and ‘depletion’, for example.

Retrieve sentiment scores (and a reference value)

Now we want to count how often the positive and negative terms in the dictionary occur in our documents on climate change talk in the UNGA. To this end we turn the climatecorp corpus established above into a document-frequency matrix that contains only the terms also found in the LSD dictionary. Then we store the respective counts in the climate2 data frame.

# Create a matrix with dictionary term counts aggregated to dictionary category
sent <- dfm(climatecorp, dictionary = lsd, tolower = TRUE) 

sent@Dimnames$features # Look at the order of dimension names of the resulting matrix

## [1] "negative"     "positive"     "neg_positive" "neg_negative"

# Extract the counts and store them in the data frame with the climate texts
# Note: document order of data and dfm is retained
climate2$neg <- as.numeric(sent[ ,1]) # Number of negative LSD terms
climate2$pos <- as.numeric(sent[ ,2]) # Number of positive LSD terms 
climate2$neg.pos <- as.numeric(sent[ ,3]) # Negated positive terms
climate2$neg.neg <- as.numeric(sent[ ,4]) # Negated negative terms

We now aggregate these counts to net-sentiment on the document level by substracting the sum of all negative (and negated positive) terms from the sum of all positive (and negated negative) terms. This raw score is then normalized against the overall number of terms in each document to have a more comparable measure.

# Calculate sentiment scores
climate2$sent.raw <- (climate2$pos + climate2$neg.neg) - (climate2$neg + climate2$neg.pos) # See formula in Young/Soroka 2011
hist(climate2$sent.raw, main = "Distribution of raw sentiment score")

# Calculate length of each document (sum of all term frequency in document matrix without stopwords)
climate2$termlength <- as.numeric(rowSums(dfm(climate2$text, remove = stopwords("english"), remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, verbose = TRUE))) 

# Normalize raw sentiment score to length
climate2$sent <- climate2$sent.raw/climate2$termlength 
hist(climate2$sent, main = "Distribution of normalized sentiment score")

Note that we should be cautious about interpreting sentiment scores on a ratio scale where the zero point indicates ‘true’ neutrality . Rather it should be assessed against the ‘baseline sentiment’ in the context of interest. To get such a reference value, the following steps calculate the sentiment scores for all full UNGA speeches since 1987 and then extract the mean sentiment and its confidence interval. Against this reference we can more meaningfully assess in how far climate change is framed as a positive or negative issue in the United Nations General Assembly.

sent.gen <- dfm(qungd, dictionary = lsd, tolower = TRUE) # Dictionary counts in whole speeches
ungd$neg <- as.numeric(sent.gen[ ,1]) # Number of negative LSD terms
ungd$pos <- as.numeric(sent.gen[ ,2]) # Number of positive LSD terms 
ungd$neg.pos <- as.numeric(sent.gen[ ,3]) # Negated positive terms
ungd$neg.neg <- as.numeric(sent.gen[ ,4]) # Negated negative terms

ungd$sent.raw <- (ungd$pos + ungd$neg.neg) - (ungd$neg + ungd$neg.pos) # See Young/Soroka 2011

# Calculate length of each document (sum of all term frequency in document matrix without stopwords)
ungd$termlength <- as.numeric(rowSums(dfm(qungd, remove = stopwords("english"), remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, verbose = TRUE))) 
# Normalize ra sentiment score to text length
ungd$sent <- ungd$sent.raw/ungd$termlength 

# Calculate mean value and its confidence bounds
average.sentiment <- mean(ungd$sent[ungd$year >= 1987]) # Reference value mean
average.sentiment.sd <- sd(ungd$sent[ungd$year >= 1988]) # Reference value standard deviation
error <- qnorm(0.99)*average.sentiment.sd/sqrt(nrow(ungd[ungd$year >= 1987, ])) # 99% confidence error
average.sentiment.upper <- average.sentiment + error # Confidence interval upper bound
average.sentiment.lower <- average.sentiment - error # Confidence intervall lower bound

Climate-change related sentiment over time

Now we can finally have a look at the sentiment that national delegates attach to climate change when they speak in the UNGA. Again we use ggplot to plot the sentiment data we have derived above. Let’s start with the pattern over time (years, to be specific).

ggplot(data = climate2, aes(x = year, y = sent))+
  geom_hline(yintercept = average.sentiment.upper, linetype = "dashed", color = "#9e3173", size = .5)+ # Upper confidence bound
  geom_hline(yintercept = average.sentiment.lower, linetype = "dashed", color = "#9e3173", size = .5)+ # lower cofidence bound
  geom_hline(yintercept = average.sentiment, color = "#9e3173")+ # Mean confidence in full speeches
  stat_summary(geom = "linerange", fun.data = "mean_cl_boot", color = "#0380b5", size = .5) + # Confidence bounds of sentiment in climate change texts
  stat_summary(geom = "line", fun.y = "mean", color = "#0380b5", size = 1.5) + # Mean sentiment in climate change texts
  scale_x_continuous(breaks = seq(1970, 2015, 5), minor_breaks = seq(1970, 2017, 1))+
  labs(title = "Climate change sentiment expressed in UN General Assembly speeches",
       subtitle = "Mean sentiment score (LSD) in 100-term windows around 'climate change' / 'global warming' references\nMean sentiment across all full UNGA speeches during period in purple",
       x = "Year",
       y = "")+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 0, vjust = .5, size = 12),
        axis.text.y = element_text(size = 12),
        text=element_text(family = "serif"),
        panel.border = element_blank(),
        axis.line = element_line(colour = "black"),
        plot.title = element_text(size=14, face='bold'),
        plot.caption = element_text(size=14))

ggsave(file = "PL_SentimentOverTime.png", width = 18, height = 12, units = "cm")

This plot highlights that national delegates, on average, tend to use more negatively connoted language when referring to climate change as compared to the mean sentiment across all full speeches in the assembly. This pattern varies significantly over time, however. In the period between 1990 and 1998 climate change sentiment was not significantly different rom the overall sentiment UNGA speeches tend to express. Afterwards, climate change related language becomes and stays more negative to then recover briefly in 2015/16.
We would need more contextual knowledge to interpret these movements substantially but at first sight the patterns suggest that UNFCC COP conferences and especially concluded agreements matter here (note, e.g., that conclusion of the Kyoto Protocol in 1997 as well as the Paris Agreement in late 2015 are associated with local sentiment peaks).

Climate-change related sentiment by country

Speaking about climate bargains, we also want to know what sentiment individual states associate with references to climate change in UNGA speeches. Let ggplot aggregate the retrieved sentiment values along those lines:

ggplot(data = climate2, aes(x = reorder(country, sent, mean), y = sent))+
  geom_hline(yintercept = average.sentiment.upper, linetype = "dashed", color = "#9e3173", size = .5)+ # Upper confidence bound
  geom_hline(yintercept = average.sentiment.lower, linetype = "dashed", color = "#9e3173", size = .5)+ # lower cofidence bound
  geom_hline(yintercept = average.sentiment, color = "#9e3173")+ # Mean confidence in full speeches
  stat_summary(geom = "pointrange", fun.data = "mean_cl_boot", color = "#0380b5", size = .5) +
  stat_summary(geom = "point", fun.y = "mean", color = "#0380b5", size = 3) +
  labs(title = "Climate change sentiment expressed in UN General Assembly speeches",
       subtitle = "Mean sentiment score (LSD) in 100-term windows around 'climate change' / 'global warming' by country of speaker",
       x = "",
       y = "")+
  coord_flip(ylim = c(-0.2, 0.25))+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 0, vjust = .5),
        text=element_text(family = "serif"),
        panel.border = element_blank(),
        axis.line = element_line(colour = "black"),
        plot.title =element_text(size=12, face='bold'),
        axis.text.y = element_text(size = 9, vjust = .3))

ggsave(file = "PL_SentimentAcrossAllCountries.png", width = 17, height = 26, units = "cm")

This is arguably too much for a single plot. But this aggregate perspective reveals some interesting information about climate-change related sentiment in UNGA speeches:

There is quite some variation in the mean sentiment around climate change refernces across individual countries
Yet, there is also quite some variation ‘within’ indivdual countries and across time (seeUzbekistan or Georgia, e.g.)
Despite this variation within countries, some states express climate-change related sentiment that is consistently more positive/negative than the mean of all UNGA speeches. The United Arab Emirates, China, or Saudi Arabia use rather postive language when refering to climate change. Other countries, like Djibouti, Sierra Leone, St. Vincent & Grenadines, or Haiti, for example, use much more negative language when referring to climate change.

As noted above, whether we can interpret this as expressed positions towards climate change (or its mitigation?) depends on how well our 100-term window actually captures speech segments that primarily adresses climate change. This assumption can be checked, of course, by qualitatively assessing samples from these text bits. Let’s have a look at Israel which is the most negative outlier in our data but has referred to climate change only once in the period we cover.

climate2$text[climate2$country == "Israel"]

## [1] " is not just a problem for my country , it is a problem for the other countries too . If the United Nations spends so much time condemning the only liberal democracy in the Middle East , it has far less time to address war , disease , poverty , and all of the other serious problems that plague the planet . Are the half-million slaughtered Syrians helped by Member Statesâ € ™ condemnation of Israel ? The same Israel that has treated thousands of injured Syrians in our hospitals , including a field hospital that I built right along "

Note that we have removed the climate change reference itself earlier (see the second code-chunk above). But in this example it was most likely in the list with ‘war’, ‘disease’, and ‘poverty’ that the Israeli representative refers to. This example, however, was apparently not primarily about climate change but rather about security concerns.
Still the observation hat climate change was mentioned in a very negative context is correct. But with regard to interpreting this as a climate change position right away, caution is warranted. The sentiment measure is at least ‘noisy’ in this regard. Let’s look at an example of the outlier with the most positive laguage around climate change references, the United Arab Emirates.

climate2$text[climate2$country == "United Arab Emirates"][1]

## [1] " meantime , we welcome the positive results of the West Point conference on combating terrorism , and look forward to the success of the next conference , to be held in the Republic of Korea . As part of our efforts to strengthen international cooperation in addressing the challenges of , and in order to assist States most at risk from the adverse effects of that phenomenon , we have initiated a partnership programme with the Pacific small island developing States , and we call upon the international community to support and expand the partnership at the global level .   adverse effects of that phenomenon , we have initiated a partnership programme with the Pacific small island developing States , and we call upon the international community to support and expand the partnership at the global level . It is our hope that States parties to the Framework Convention on will achieve concrete results before the forthcoming Conference in Mexico . We are pleased to see that the accord establishing the International Renewable Energy Agency has entered into force after ratification by the required number of States . In its capacity as host country , the United Arab Emirates continues "

This exemplary text also starts from security topics, but is by and large about climate change, calling for (future) support of island states in particular by using very positively connoted language. This can be plausibly interpreted as a positive stance towards climate change mitigation (!). Yet, in sum, these two examples suggest that further validation would be necessary here … Go ahead!

Climate-change related sentiment by country - a crude model

Checking whether our sentiment scores are plausibly related to external, climate-change data provides another hint as to whether they are a good measure for (expressed) climate change positions. To this end, we exploit the cmeans data already used in the first Tutorial to see whether the average climate-change sentiment is statistically affected by aggregated economic, political, and geographic indicators. We aggregate the data frame with the sentiment scores (climate2), merge it with cmeans by the iso country code variable, then run a simple linear regression (crude!) and interpret it visually with the help of the coeffplot and ggplot packages (see Tutorial 1 for comments on the individual steps).

countries <- aggregate(sent ~ country+iso3c, data = climate2, FUN = "mean") # Mean sentiment score by country
load("cmeans.Rdata") # Country data 1987-2017 averages, collected from World Bank, Wikipedia, Polity, and Freedom House (crude!)
countries <- merge(countries, cmeans, by = "iso3c", all.x = T) # merge with the aggregated sentiment data
countries <- countries[complete.cases(countries), ] # Keep only countries with information on all variables

# Regression model
fit <- lm(scale(sent)~scale(gdpc) + scale(co2emiss) + scale(fuel.exp) + scale(democracy) + scale(eq.dist) + scale(below5) + scale(island), data = countries)
summary(fit)

## 
## Call:
## lm(formula = scale(sent) ~ scale(gdpc) + scale(co2emiss) + scale(fuel.exp) + 
##     scale(democracy) + scale(eq.dist) + scale(below5) + scale(island), 
##     data = countries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8182 -0.5694  0.0737  0.6113  2.6660 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)      -1.856e-16  7.178e-02   0.000    1.000
## scale(gdpc)      -9.094e-02  1.533e-01  -0.593    0.554
## scale(co2emiss)   1.955e-01  1.541e-01   1.268    0.206
## scale(fuel.exp)   4.133e-02  8.791e-02   0.470    0.639
## scale(democracy)  7.827e-02  8.889e-02   0.880    0.380
## scale(eq.dist)    8.598e-02  8.772e-02   0.980    0.328
## scale(below5)    -2.661e-02  7.664e-02  -0.347    0.729
## scale(island)    -8.721e-02  8.395e-02  -1.039    0.300
## 
## Residual standard error: 0.9921 on 183 degrees of freedom
## Multiple R-squared:  0.05207,    Adjusted R-squared:  0.01581 
## F-statistic: 1.436 on 7 and 183 DF,  p-value: 0.1932

# Coefficient data
reg.data <- coefplot(fit, plot = FALSE) # Get coefficient data
reg.data <- reg.data[reg.data$Coefficient != "(Intercept)", ] # Uninformative in a standardized model
reg.data$name <- c("GDP per capita", "CO2 emissions per capita", "Fossil fuel exports", "Democracy", "Distance to equator", "Area below 5m (%)", "Island state") # Human readable names
reg.data <- reg.data[order(reg.data$Value), ] # Sort in ascending effect size
reg.data$name <- factor(reg.data$name, levels = reg.data$name)

# And plot
ggplot(data = reg.data, aes(x=name, y = Value))+
  geom_hline(yintercept = 0, linetype = "dashed", colour = "red", size = 1)+
  geom_linerange(aes(ymin = LowOuter, ymax = HighOuter), size = .8, colour = "#0380b5")+
  geom_pointrange(aes(ymin = LowInner, ymax = HighInner), size = 1.5, colour = "#0380b5")+
  labs(title = "What sentiment do states attach to climate change?",
       subtitle = "Explaining the average sentiment around climate change references by some crude national-level variables",
       y = "Standardized regression coefficient\nwith 95 and 99% confidence intervals\n",
       x = "",
       caption = "Linear Model, n = 191 countries, Adj. R2: .02.")+
  coord_flip()+
  theme_bw()+
  theme(axis.text.x = element_text(size = 14, angle = 0, vjust = .5),
        axis.title = element_text(size = 14),
        axis.text.y = element_text(size = 14, face = "bold", vjust = .3),
        text=element_text(family = "serif"),
        axis.line = element_line(colour = "black"),
        plot.title =element_text(size=16, face='bold'),
        plot.subtitle = element_text(size=14),
        plot.caption = element_text(size = 14))

ggsave(file = "PL_SentimentOverCountries_Model.png", width = 30, height = 18, units = "cm") # Export to file

Apparently we see … not much. The model fit is extremely low and the tendencies we see are statistically not different from zero (red line in the figure). Together with our qualitative insights on individual text bits above, this suggest that our sentiment scores in the 100-term window around climate change references (which are averaged across time here, in addition) are way too crude to capture meaningful positions.

Do you have ideas to sharpen the approach? How about scrutinizing more targeted references to specific climate agreements and treaties, for example? To then explain them with more theoretically guided variables that also vary over time? You now have the basic tools to conduct and test a sentiment analysis on your own.

But for now, lets head over to the next turorials which introduce you to different approaches to gather information about political positions from textual data.

Tutorial for Session 3 Climate change sentiment in UN General Assembly Speeches

Text-mining international politics Block seminar at the Charles University Prague, May 2019

Christian Rauh (2019-04-29 13:48:42)

Introduction

Prepare R session and data

Sentiment analysis

The sentiment dictionary

Retrieve sentiment scores (and a reference value)