This and the following tutorials aim to encourage you to start on your own with text analyses in the free and open-source R enivronment (ideally combined with RStudio or any comparable IDE).
It provides you with the data and the R code that created the exemplary analyses presented during the seminar.
Of course, each of these analyses requires much more theoretical guidance, careful validation, specification and data cleaning. But starting from the code and the material provided here, you are cordially invited to tweak this further!
We will make use of a couple of add-on packages that provide the functionality we use later on. Before loading (or: attaching) them with the code snippet below, you will probably need to install them first by using the corresponding R function install.packages("package name")
.
# Packages
library(tidyverse) # data management tools, includes stringr for text manipulation and ggplot for plotting
library(quanteda) # efficient implementations of many text-as-data functions
library(coefplot) # to extract coefficients from regression models
After making sure that you are operating in the correct working directory along the lines below, you are good to go.
# Working directory
# setwd("Your/Path")
For our running example, we resort to the collection of national delegate speeches in the United Nations General Assembly provided by Baturo, Dasandi, and Mikhaylov (2017). You can acess the raw data here.
For the tutorial I have already assembled a corresponding R data frame and added ISO3C country codes. With the following steps we load these data (from your working directory!), add a count variable, and take a look at the data structure.
# Load UNGD text data
load("UNGD-corpus.Rdata")
ungd$count <- 1 # A simple count variable for aggregation purposes
str(ungd) # Get an overview of the dataset
## 'data.frame': 7898 obs. of 10 variables:
## $ year : int 1970 1970 1970 1970 1970 1970 1970 1970 1970 1970 ...
## $ session : int 25 25 25 25 25 25 25 25 25 25 ...
## $ text : chr "33: May I first convey to our President the congratulations of the Albanian delegation on his election to the P"| __truncated__ "177.\t : It is a fortunate coincidence that precisely at a time when the United Nations is celebrating its firs"| __truncated__ "100.\t It is a pleasure for me to extend to you, Mr. President, the warmest congratulations of the Australia G"| __truncated__ "155.\t May I begin by expressing to Ambassador Hambro, on behalf of the delegation of Austria, our best wishes"| __truncated__ ...
## $ speaker : chr "Mr. NAS" "Mr. DE PABLO PARDO" "Mr. McMAHON" "Mr. KIRCHSCHLAEGER" ...
## $ speaker.type: chr NA NA NA NA ...
## $ org.language: chr "French " "Spanish " NA NA ...
## $ iso3c : chr "ALB" "ARG" "AUS" "AUT" ...
## $ country2 : chr "Albania" "Argentina" "Australia" "Austria" ...
## $ country : chr "Albania" "Argentina" "Australia" "Austria" ...
## $ count : num 1 1 1 1 1 1 1 1 1 1 ...
Before starting the analyses, we manipulate the raw text slightly. As noted in the seminar, such pre-processing steps will affect the output of all text-as-data analyses - so be conscious about what you do in this regard.
Note: These steps alread use ‘regular expressions’. Regular expressions provide extremely powerful tools for text manipulation and should be definitely in your toolkit. For an introduction to string manipulation in R, I can recommend this wikibook. If you want to get started, test your regular expressions on sample text: this online ‘sandbox’ is useful in this regard.
ungd$text2 <- ungd$text # a copy of the original text
ungd$text2 <- gsub("[0-9]", "", ungd$text2, fixed = F) # remove numbers
ungd$text2 <- gsub("[[:punct:]]", "", ungd$text2, fixed = F) # remove punctuation
ungd$text2 <- str_trim(ungd$text2, side = "both") # remove white spaces left and right
ungd$text2 <- paste(" ", ungd$text2, " ", sep = "") # add exactly one white space left and right
ungd$text2 <- tolower(ungd$text2) # set everything to lower case
To identify whether a national delegate speaks about climate change at all, we establish a simple ‘dictionary’ that aims to find the following terms:
We wrap these terms in a simple regular expression by using the |
operator which denotes a logical ‘or’ in R syntax.
# A rudimentary dictionary to find the climate change issue
cc.markers <- "( climate( |-)change )|( global( |-)warming )"
We then count and store how often (and whether) a speech makes reference to climate change by using one of our search terms.
# Count references to climate change along the markers
ungd$cc.count <- str_count(ungd$text2, cc.markers) # numerical: how often does our search expresssion occur in each speech?
ungd$cc.pres <- ungd$cc.count > 0 # logical: is there at least one reference?
sum(ungd$cc.pres) # Number of speeches with search expression present
## [1] 1781
(sum(ungd$cc.pres)/nrow(ungd))*100 # Share of speeches with search expression present
## [1] 22.55001
With this information, we can now exploit the ggplot
package (part of tidyverse
) to visually analyse how strongly climate change figured in UNGA speeches over time.
Note: ggplot provides a system of plotting facilities that allows you to layer several pieces of information from your data above each other. Dont’ get distracted from the rather large code-chunck here - ggplot
allows you to manipulate many details of the plot (and I may be a fetishist in this regard). Just try it out on data you are interested in!
ggplot(data = ungd, aes(x = year, y = as.integer(cc.pres)))+
stat_summary(geom = "line", fun.y = "mean", color = "darkblue", size = 2) + # note: mean value of the logical (0|1) variable equals the share
stat_summary(geom = "linerange", fun.data = "mean_cl_boot", color = "darkblue", size = .5) + # confidence intervalls per year
scale_x_continuous(breaks = seq(1970, 2015, 5), minor_breaks = seq(1970, 2017, 1))+
labs(title = "Climate change salience in UN General Assembly speeches over time",
subtitle = "Share of speeches with references to 'climate change' or 'global warming'",
x = "Year",
y = "")+
theme_bw()+
theme(axis.text.x = element_text(angle = 0, vjust = .5),
text=element_text(family = "serif"),
panel.border = element_blank(),
axis.line = element_line(colour = "black"),
plot.title =element_text(size=12, face='bold'))
# ggsave(file = "PL_CC-SalienceOverTime.png", width = 18, height = 12, units = "cm") # if you want to export the plot
Without claiming expertise in climate change politics, this pattern looks interesting. First we note that climate change did not figure in UNGA speeches before 1987 (go ahead, try to find out which country mentioned it first!). Then it was referred to in about 10 percent of UNGA speeches until 2006 when things apparently changed dramatically. Between 2007 and 2010, around three quarters of all UNGA speeches referred to climate change. This emphasis drops again somewhat afterwards but increases to similar levels in and after 2015. So: What happened? (Hint: try to add the dates of the different UNFCCC COP meetings to the plot[geom_vline()
]).
Of course, we want to know which countries speak about climate change in the General Assembly. To this end we aggregate our dictionary counts in an new object countries
and express it relative to the number of speeches each country has delivered.
# Sum of speeches and speeches with climate change reference by country
countries <- aggregate(cbind(ungd$count, ungd$cc.pres), by = list(ungd$country2, ungd$iso3c), FUN = sum)
names(countries) <- c("country", "iso3c", "speeches", "cc.pres")
countries$share <- (countries$cc.pres/countries$speeches) # The share of speeches with climate change references
countries <- countries[order(-countries$share), ] # Order data by share (descending)
From this object we extract the most and least frequent users of our search terms…
countries.extr <- countries[c(1:20, 179:198), ] # Most extreme shares (first and last 20 rows in ordered data)
countries.extr$group[1:20] <- "20 most frequent referrers"
countries.extr$group[21:40] <- "20 least frequent referrers"
countries.extr$group <-factor(countries.extr$group, levels = c("20 most frequent referrers", "20 least frequent referrers"))
countries.extr$share <- countries.extr$share * 100 # Express as percentage
countries.extr$totals <- paste("(", countries.extr$speeches, ") ", sep ="") # Label total number of speeches
… to then plot the share of their speeches that refer to climate change.
ggplot(data = countries.extr, aes(y=share, x = reorder(country, share)))+
geom_bar(stat= "identity", colour = "#0380b5", fill = "#0380b5", width = .7)+
geom_text(label = countries.extr$totals, aes(y = min(range(share))), hjust = 1, vjust = .3, family = "serif")+ # Layer total number of speeches to plot
labs(x = "",
y = "",
title = "Which countries speak about climate change?",
subtitle = "Percent of UN General Assembly speeches mentioning climate change (total # of speeches in brackets)")+
scale_y_continuous(breaks = seq(0,100,25), limits = c(0, 100))+
coord_flip()+
facet_wrap(vars(group), scales="free_y", ncol = 1, strip.position="right")+
theme_bw()+
theme(axis.text.x = element_text(size = 14, angle = 0, vjust = .5),
text=element_text(family = "serif"),
axis.line = element_line(colour = "black"),
plot.title =element_text(size=16, face='bold'),
plot.subtitle = element_text(size=14),
axis.text.y = element_text(size = 14, face = "bold", vjust = .3),
strip.text = element_text(face = "bold", size = 14))
# ggsave(file = "PL_CC-SalienceOverCountries.png", width = 30, height = 22, units = "cm") # Export to file
rm(countries.extr) # remove this object
One pattern is immediately apparent: Among the top 20 countries speaking most frequently about climate change in the UNGA we find mostly island states - that is states that are most strongly threatened by rising sea levels. But also Switzerland, Vatican City, and the European Union refer to climate change relatively frequently (though the latter has spoken only seven times in this data set).
Can you come up with explanations for the 20 countries that refer to climate change least frequently when adressing the General Assembly?
Information derived from text analyses are often not an end in themselves, but feed into broader research designs. For example, we could dig a little deeper into the question which countries decide to refer to climate change in their international speeches. To showcase this I have collected a small dataset cmeans
on crude economic, geographic and political indicators for individual countries (averaged for the 1987-2017 period). Here we load these data and merge it with our aggregated countries
object using the iso3c
country codes.
load("cmeans.Rdata") # Country data 1987-2017 averages, collected from the World Bank, Wikipedia, Polity, and Freedom House (crude!)
countries <- merge(countries, cmeans, by = "iso3c", all.x = T) # merge with the aggregated speech data
countries <- countries[complete.cases(countries), ] # Keep only countries with information on all variables
We can use these indicators now to ‘explain’ the share of speeches with climate change references. To this end, we estimate a (again: crude) linear regression. By scaling each of the variables, we arrive at standardized regression coefficients.
fit <- lm(scale(share)~scale(gdpc) + scale(co2emiss) + scale(fuel.exp) + scale(democracy) + scale(eq.dist) + scale(below5) + scale(island), data = countries)
summary(fit)
##
## Call:
## lm(formula = scale(share) ~ scale(gdpc) + scale(co2emiss) + scale(fuel.exp) +
## scale(democracy) + scale(eq.dist) + scale(below5) + scale(island),
## data = countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8806 -0.4259 -0.1049 0.2840 3.0818
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.385e-16 5.524e-02 0.000 1.0000
## scale(gdpc) -2.871e-01 1.180e-01 -2.433 0.0159 *
## scale(co2emiss) 1.617e-01 1.186e-01 1.363 0.1745
## scale(fuel.exp) -5.144e-02 6.765e-02 -0.760 0.4480
## scale(democracy) 2.913e-01 6.841e-02 4.258 3.29e-05 ***
## scale(eq.dist) -8.005e-02 6.751e-02 -1.186 0.2372
## scale(below5) 3.072e-01 5.898e-02 5.208 5.10e-07 ***
## scale(island) 3.152e-01 6.460e-02 4.879 2.31e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7634 on 183 degrees of freedom
## Multiple R-squared: 0.4386, Adjusted R-squared: 0.4172
## F-statistic: 20.43 on 7 and 183 DF, p-value: < 2.2e-16
Let’s analyse the results of this estimation visually. We store the regression coefficients and their confidence intervals in the separate reg.data
object, add human-readable variable names, and then plot this in ascending order of estimated effect sizes.
reg.data <- coefplot(fit, plot = FALSE) # Get coefficient data
reg.data <- reg.data[reg.data$Coefficient != "(Intercept)", ] # Uninformative in a standardized model
reg.data$name <- c("GDP per capita", "CO2 emissions per capita", "Fossil fuel exports", "Democracy", "Distance to equator", "Area below 5m (%)", "Island state") # Human readable names
reg.data <- reg.data[order(reg.data$Value), ] # Sort in ascending effect size
reg.data$name <- factor(reg.data$name, levels = reg.data$name)
ggplot(data = reg.data, aes(x=name, y = Value))+
geom_hline(yintercept = 0, linetype = "dashed", colour = "red", size = 1)+
geom_linerange(aes(ymin = LowOuter, ymax = HighOuter), size = .8, colour = "#0380b5")+
geom_pointrange(aes(ymin = LowInner, ymax = HighInner), size = 1.5, colour = "#0380b5")+
labs(title = "Which countries speak about climate change?",
subtitle = "Explaining the share of UNGA speeches with climate change references by some crude national-level variables",
y = "Standardized regression coefficient\nwith 95 and 99% confidence intervals\n",
x = "",
caption = "Linear Model, n = 191 countries, Adj. R2: .42.")+
coord_flip()+
theme_bw()+
theme(axis.text.x = element_text(size = 14, angle = 0, vjust = .5),
axis.title = element_text(size = 14),
axis.text.y = element_text(size = 14, face = "bold", vjust = .3),
text=element_text(family = "serif"),
axis.line = element_line(colour = "black"),
plot.title =element_text(size=16, face='bold'),
plot.subtitle = element_text(size=14),
plot.caption = element_text(size = 14))
# ggsave(file = "PL_CC-SalienceOverCountries_Model.png", width = 30, height = 18, units = "cm") # Export to file
An interesting starting point: this model already captures some 40 percent of the cross sectional variation and despite covering ‘only’ 191 country-level observations some indicators exhibit surprisingly robust statistical effects.
As expected, being an island state increase the propensity to refer to climate change (by .25 standard deviations). But also non-island states with a high share of low-elevation land also tend to emphasise the issue more strongly in their international speeches. Affectedness seems to matter.
Being a democracy (i.e. having a Polity IV score above +6) is also associated with more climate change emphasis in the UNGA (maybe a civil society effect …). Surprisingly (at least if you think this is based on cost-considerations), CO2 emmissions are positively associated with the relative frequency of climate change talk, though this effect is not fully robust. Being an oil exporter or resinding further away from the equator, in contrast, lets you speak about climate change less frequently - but only in tendency, these effects are not robust by standard statistical conventions.
What seems to matter, however, is the level of economic development: all else equal, richer countries de-emphasize climate change in the UNGA (maybe another cost-based argument…).
Note that this is - of course - not a really good analysis: we have not specified a clear theory to guide us, the linear model specification is sub-optimal for these data, the quality on our independent variables is somewhat questionable, and we have averaged over a very looong period. But you should see that the variables derived with a very simple text analysis already provide some leverage for substantially interesting questions. So, if you have better ideas, go ahead!
To round off our inductive endeavour, let’s have a quick look at what the country delegates speak about when they refer to climate change in the UNGA. To this end we leverage the quanteda package which offers a set of highly efficient functions for quantitative text analysis (mostly of the ‘bags-of-words’ type).
We firstly turn our vector of speech texts into a quanteda corpus object (a class of objects specific to the package) and store the year and the ISO country as document-level variables in this object (colums 1 and 7 in our ungd
data frame).
qungd <- corpus(ungd$text2, docvars = ungd[ , c(1,7)])
# summary(qungd)
Based on this representation of the speech texts, we can now extract a ‘window’ of text around the climate change references by using quanteda’s kwic
(keyword-in-context) function. Here we somewhat arbitrarily use 100 terms (50 left and right) around each reference, but other (probably: more guided) approaches to ‘unitizing’ are possible (think of sentences, paragraphs, etc.).
climate_kw <- kwic(qungd, phrase(c("global warming", "climate change", "global-warming", "climate-change")), valuetype = "fixed", case_insensitive = T, window = 50)
We then store these ‘documents’ in a separate data frame, clean it up a little bit…
climate <- data.frame(climate_kw)
climate$text <- paste(" ", climate$pre, " ", climate$post, " ", sep = "") # Combining the text around the markers (without markers themselves)
climate$text <- str_replace_all(climate$text, "(global( |-)warming)|(climate( |-)change)", " ") # Remove remaining markers (KWIC overlap cases)
climate$pre <- climate$post <- climate$keyword <- climate$from <- climate$to <- NULL # Drop variables no longer needed
docids <- docvars(qungd) # Extract document variables from underlying quanteda corpus
docids$docname <- row.names(docids) # document identifiers in corpus
climate <- merge(climate, docids, by = "docname", all.x = T) # Merge
climate$docname <- NULL
… to then aggregate the texts to year and country …
climate <- aggregate(text ~ year+iso3c, paste, collapse = " ", data = climate)
… add some identifier variables for comparative purposes…
climate$period <- ifelse(climate$year <= 2006, "Until 2006", "After 2006") # Marking the break observed in the timeline above
climate <- merge(climate, cmeans[, c("island", "iso3c")], by = "iso3c") # Marker for island states from the 'cmeans' data above
climate$island[is.na(climate$island)] <- FALSE # declare countries with no information on island va as non-island states
climate$island <- ifelse(climate$island, "Island states", "Other countries") # Nicer labels
… and store this as a quanteda corpus as well.
climatecorp <- corpus(climate$text, docvars = climate[, c(1:2, 4:5)])
We now turn this into a document frequency matrix (the literal ‘bag of words’ as shown in the seminar). Note that some additional text pre-processing happens (intentionally) in this step: we remove very common english words bearing little substantial content(‘stopwords’), certain irregular ‘features’, and ‘stem’ all words to common roots.
quanteda_options(language_stemmer = "english")
climatedfm <- dfm(climatecorp, remove = stopwords("english"), remove_symbols = TRUE, stem = TRUE, verbose = TRUE)
climatedfm <- dfm_select(climatedfm, pattern = "â", selection = "remove", verbose = T)
climatedfm <- dfm_select(climatedfm, pattern = "s", selection = "remove", verbose = T)
So, what are they talking about when referring to climate change? Let’s look at the 200 most frequent words.
# png("PL_Ttop200terms.png", width = 1200, height = 1200, units = "px") # if you want to export
textplot_wordcloud(climatedfm, max_words = 200, adjust = 0, min_size = 0.1, max_size = 7.5, rotation = 0.1,
color = rev(RColorBrewer::brewer.pal(10, "RdBu")))
# dev.off() # if you want to export
Wordclouds are not a particularly good analytical tool (there are smarter methods based on term co-ocurrence and clustering that are waiting for you!). But for start we can gain an impression of what states speak about in the immediate context of climate change references. [Your interpretation here].
Term ferquencies become more informative if we compare them across (theoretically) meaningful categories. To showcase this and picking up on some of the earlier insights, we distinguish island from non-island states as well as the period before and after 2006.
With quanteda, the approach is quite straightforward: we group our existing document-frequency matrix along the respective document variables and then plot a comparative wordcloud.
climatedfm2 <- dfm_group(climatedfm, groups = "period") # Collapse documents in matrix to one before and after 2006
# png("PL_termsBeforeAfter2006.png", width = 1200, height = 1200, units = "px")
textplot_wordcloud(climatedfm2, max_words = 200, adjust = 0, min_size = 0.25, max_size = 4.9, rotation = 0.1, comparison = T,
color = c("#9e3173", "#0380b5"), labelsize = 4)
# dev.off()
This gives you the 200 terms that were most frequent in one period, but not in the other. One might interpret that the running out of the Kyoto protocol in 2005 and the debates on follow-up agreements are reflected here. Not only because of the literal references, but also by the fact that before 2006 terms referring to climate change threats were dominant (e.g. ‘sealevel’, ‘rise’, ‘problem’, ‘depletion’) while after 2006 mitigiation measures seem to have figured more prominently (‘response’, ‘impact’, ‘mitigation’, ‘goal’, ‘agreement’). But who am I to judge … clear theory would also help here but the potential of comparing contextual term frequencies should be clear. If not, look at what island states say compared to others:
climatedfm2 <- dfm_group(climatedfm, groups = "island") # Collapse documents in matrix to one of island and non-island states
# png("PL_TermsIsland.png", width = 1200, height = 1200, units = "px")
textplot_wordcloud(climatedfm2, max_words = 200, adjust = 0, min_size = 0.4, max_size = 4.9, rotation = 0.1, comparison = T,
color = c("#9e3173", "#0380b5"), labelsize = 4)
# dev.off()
Add your interpretation here or head over to the tutorial for session 3 in which we apply text sentiment analyses to these text data.