Introduction

 
This tutorial provides an exemplary implementation of a typical application of dictionary-based text analysis: retrieving the sentiment expressed in documents. The positivity or negativity a speaker or author expresses in her or his political communication is possibly relevant evidence for many political science arguments.
 
So let us have a look at one way of implementing such analyses in R while also adressing some of the validity pitfalls mentioned during the seminar. We exploit our running example of climate change talk in the United Nationas General Assembly. Again, these analyses should not be taken as final - rather you are cordially invited to push them further on your own!

 
 

Prepare R session and data

 
Like in the first tutorial (on session 2, if you haven’t seen it yet, start there!) we initially attach some add-on packages to our R session, specify the working directory, and load a cleaned version of the UNGD corpus.
 

# Packages
library(tidyverse) # data management tools, includes stringr for text manipulation and ggplot for plotting
library(quanteda) # efficient implementations of many text-as-data functions
library(coefplot) # to extract coefficients from regression models

# Working directory
# setwd("Your/Path")

# Load UNGD text data
load("UNGD-corpus.Rdata") 

 
And again we heavily rely on the object classes and functions provided in the quanteda package. Equivalent to the first tutorial, the following steps transform the raw texts of the UNGA speeches into a quanteda corpus we call qungd, then extract a 100-term window around each climate change or global warming reference therein via the kwic function, and finally store the resulting ‘documents’ in a separate corpus object called climatecorp.
 

# A quanteda corpus with year (column 1), ISO country code (7) and country name (9) as document variables
qungd <- corpus(ungd$text, docvars = ungd[ , c(1,7,9)]) 

# Focus on climate change and global warming - 100 term windows around our markers
climate_kw <- kwic(qungd, phrase(c("global warming", "climate change", "global-warming", "climate-change")), valuetype = "fixed", case_insensitive = T, window = 50)

# Clean resulting texts
climate <- data.frame(climate_kw)
climate$text <- paste(" ", climate$pre, " ", climate$post, " ", sep = "") # Combining the text around the markers (without markers themselves)
climate$text <- str_replace_all(climate$text, "(global warming)|(climate change)", " ") # Remove remaining markers (KWIC overlap cases)
climate$pre <- climate$post <- climate$keyword <- climate$from <- climate$to <- NULL

# Add document variables
docids <- docvars(qungd) # From underlying corpus
docids$docname <- row.names(docids) # Document identifiers in corpus
climate <- merge(climate, docids, by = "docname", all.x = T) # Merge
climate$docname <- NULL

# Aggregate to year and country
climate2 <- aggregate(text ~ year+country+iso3c, paste, collapse = " ", data = climate)

# Turn this into a separate corpus
climatecorp <- corpus(climate2$text, docvars = climate2[, c(1:3)])

 
 

Sentiment analysis

 
Dictionary-based sentiment analyses derive the positivity/negativity conveyed by a text through counting and aggregating the occurences of positively or negatively rated words from a pre-defined sentiment dictionary.

To the extent that we can assume that the texts under analysis focus primarily on the political object of interest (climate change in our running example), the resulting sentiment scores could be interpreted as the degree of support/opposition a text expresses towards that object. This assumption works in some contexts (see e.g. here) but should not be taken for granted as we will see below. In our application, for example, the choice of 100-term windows is somewhat arbitrary and one could think of more focussed approaches to unitizing (sentences, sentences in which climate change is the grammatical object, and so on). But for the sake of simplicity, let’s assume that this window roughly denotes what a national delegate has to say on or in the context of climate change (your are welcome to apply the code presented here to more refined windows, of course).
 

The sentiment dictionary

 
The validity of a sentiment analysis stands and falls with the quality and the suitability of the dictionary used. Words that convey positivity in some contexts might be irrelevant or even associated with negative connotations in other contexts. So you should ensure that the dictionary you use matches the (type of) language in the political setting you want to analyse.
 
For our example, we resort to the Lexicoder Sentiment Dictionary (LSD) provided by Lori Young and Stuart Soroka (2011) because this particular dictionary has been extensively validated against how human readers assess political news in the English language (Proksch et al have recently validated translated versions of this dictionary as well). Conveniently, the LSD dictionary is also directly shipped with the quanteda package. Let’s store this dictionary in a dedicated object and look at it’s structure and some sample terms.
 

lsd <- data_dictionary_LSD2015 # LSD as shipped with quanteda
str(lsd)
## Formal class 'dictionary2' [package "quanteda"] with 2 slots
##   ..@ .Data       :List of 4
##   .. ..$ :List of 1
##   .. .. ..$ : chr [1:2858] "a lie" "abandon*" "abas*" "abattoir*" ...
##   .. ..$ :List of 1
##   .. .. ..$ : chr [1:1709] "ability*" "abound*" "absolv*" "absorbent*" ...
##   .. ..$ :List of 1
##   .. .. ..$ : chr [1:1721] "best not" "better not" "no damag*" "no no" ...
##   .. ..$ :List of 1
##   .. .. ..$ : chr [1:2860] "not a lie" "not abandon*" "not abas*" "not abattoir*" ...
##   ..@ concatenator: chr " "
sample(lsd$positive, 10) # Random terms rated as positive
##  [1] "picturesqu*"  "kindness*"    "conservation" "suitably"    
##  [5] "educate*"     "assent*"      "tact"         "veritab*"    
##  [9] "sweetly"      "sweeten*"
sample(lsd$negative, 10) # Random terms rated as negative
##  [1] "lag"         "domineer*"   "robs"        "trespass*"   "smasher*"   
##  [6] "tamper*"     "browbeaten*" "conviction*" "illness*"    "indebted*"

 
Note that the asteriks provides a wildcard that - in the quanteda context with valuetype = 'glob' - matches zero ore more characters so that ’deplet*‘would match both ’depleted’ and ‘depletion’, for example.
 

Retrieve sentiment scores (and a reference value)

 
Now we want to count how often the positive and negative terms in the dictionary occur in our documents on climate change talk in the UNGA. To this end we turn the climatecorp corpus established above into a document-frequency matrix that contains only the terms also found in the LSD dictionary. Then we store the respective counts in the climate2 data frame.
 

# Create a matrix with dictionary term counts aggregated to dictionary category
sent <- dfm(climatecorp, dictionary = lsd, tolower = TRUE) 

sent@Dimnames$features # Look at the order of dimension names of the resulting matrix
## [1] "negative"     "positive"     "neg_positive" "neg_negative"
# Extract the counts and store them in the data frame with the climate texts
# Note: document order of data and dfm is retained
climate2$neg <- as.numeric(sent[ ,1]) # Number of negative LSD terms
climate2$pos <- as.numeric(sent[ ,2]) # Number of positive LSD terms 
climate2$neg.pos <- as.numeric(sent[ ,3]) # Negated positive terms
climate2$neg.neg <- as.numeric(sent[ ,4]) # Negated negative terms

 
We now aggregate these counts to net-sentiment on the document level by substracting the sum of all negative (and negated positive) terms from the sum of all positive (and negated negative) terms. This raw score is then normalized against the overall number of terms in each document to have a more comparable measure.
 

# Calculate sentiment scores
climate2$sent.raw <- (climate2$pos + climate2$neg.neg) - (climate2$neg + climate2$neg.pos) # See formula in Young/Soroka 2011
hist(climate2$sent.raw, main = "Distribution of raw sentiment score")

# Calculate length of each document (sum of all term frequency in document matrix without stopwords)
climate2$termlength <- as.numeric(rowSums(dfm(climate2$text, remove = stopwords("english"), remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, verbose = TRUE))) 

# Normalize raw sentiment score to length
climate2$sent <- climate2$sent.raw/climate2$termlength 
hist(climate2$sent, main = "Distribution of normalized sentiment score")

 
Note that we should be cautious about interpreting sentiment scores on a ratio scale where the zero point indicates ‘true’ neutrality . Rather it should be assessed against the ‘baseline sentiment’ in the context of interest. To get such a reference value, the following steps calculate the sentiment scores for all full UNGA speeches since 1987 and then extract the mean sentiment and its confidence interval. Against this reference we can more meaningfully assess in how far climate change is framed as a positive or negative issue in the United Nations General Assembly.
 

sent.gen <- dfm(qungd, dictionary = lsd, tolower = TRUE) # Dictionary counts in whole speeches
ungd$neg <- as.numeric(sent.gen[ ,1]) # Number of negative LSD terms
ungd$pos <- as.numeric(sent.gen[ ,2]) # Number of positive LSD terms 
ungd$neg.pos <- as.numeric(sent.gen[ ,3]) # Negated positive terms
ungd$neg.neg <- as.numeric(sent.gen[ ,4]) # Negated negative terms

ungd$sent.raw <- (ungd$pos + ungd$neg.neg) - (ungd$neg + ungd$neg.pos) # See Young/Soroka 2011

# Calculate length of each document (sum of all term frequency in document matrix without stopwords)
ungd$termlength <- as.numeric(rowSums(dfm(qungd, remove = stopwords("english"), remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, verbose = TRUE))) 
# Normalize ra sentiment score to text length
ungd$sent <- ungd$sent.raw/ungd$termlength 

# Calculate mean value and its confidence bounds
average.sentiment <- mean(ungd$sent[ungd$year >= 1987]) # Reference value mean
average.sentiment.sd <- sd(ungd$sent[ungd$year >= 1988]) # Reference value standard deviation
error <- qnorm(0.99)*average.sentiment.sd/sqrt(nrow(ungd[ungd$year >= 1987, ])) # 99% confidence error
average.sentiment.upper <- average.sentiment + error # Confidence interval upper bound
average.sentiment.lower <- average.sentiment - error # Confidence intervall lower bound