Introduction

In the preceeding tutorial you have seen that text documents can be scaled along the relative frequency of words they contain. However, the Wordfish algorithm left us with one and only one dimension that, in addition, can only be interpreted ex post. But in many research settings you will be interested in a very specific conflict dimension and correspondingly targeted analyses. This tutorial, thus, looks at a scaling approach that is also based on relative term frequencies but offers more control a priori.

Prepare R session and data

By now, you know the drill.

# Packages
library(tidyverse) # data management tools, includes stringr for text manipulation and ggplot for plotting
library(quanteda) # efficient implementations of many text-as-data functions
library(coefplot) # to extract coefficients from regression models
library(knitr) # For neat visual output
library(utf8) #  For encoding transformations

# Working directory
# setwd("Your/Path")

# Load UNGD text data
load("UNGD-corpus.Rdata")

And once more, we extract the text around climate change references with quanteda package along the steps introduced in the earlier tutorials.

# A quanteda corpus with year (column 1), ISO country code (7) and country name (9) as document variables
qungd <- corpus(ungd$text, docvars = ungd[ , c(1,7,9)]) 

# Focus on climate change and global warming - 100 term windows around our markers
climate_kw <- kwic(qungd, phrase(c("global warming", "climate change", "global-warming", "climate-change")), valuetype = "fixed", case_insensitive = T, window = 50)

# Clean resulting texts
climate <- data.frame(climate_kw)
climate$text <- paste(" ", climate$pre, " ", climate$post, " ", sep = "") # Combining the text around the markers (without markers themselves)
climate$text <- str_replace_all(climate$text, "(global warming)|(climate change)", " ") # Remove remaining markers (KWIC overlap cases)
climate$pre <- climate$post <- climate$keyword <- climate$from <- climate$to <- NULL

# Add document variables
docids <- docvars(qungd) # From underlying corpus
docids$docname <- row.names(docids) # Document identifiers in corpus
climate <- merge(climate, docids, by = "docname", all.x = T) # Merge
climate$docname <- NULL

# Aggregate to year and country
climate2 <- aggregate(text ~ year+country+iso3c, paste, collapse = " ", data = climate)

# Turn this into a separate corpus
climatecorp <- corpus(climate2$text, docvars = climate2[, c(1:3)])

Wordscores: Supervised scaling

Wordscores - introduced by Benoit, Laver, and Garry (2003) - was also primarily used to study party positions but the basic idea is very flexible: Wordscores initially requires some documents (reference texts) for which the positions on the latent dimension are by and large known. In party politics, for example, you could supply manifestos from the extreme right, some centrist parties, and the extreme left, respectively coded with -1, 0, +1 to span across the left-right dimension.
The Wordscores algorithm then ‘fits’ the observed term frequencies to the known reference scores. Expressed simply, you can imagine this as a regression where the relative frequencies of individual terms serve as independent variables that predict the reference score of a document. The ‘coefficients’ of this estimation are then used as word weights (the actual word scores) to scale other documents with so far unknown positions on the latent dimensions. In other words, the ‘virgin’ documents are placed within the range of reference texts based on their relative word frequencies.

As such, Wordscores is actually an example for the supervised machine learning approaches we talked about in the seminar. If you want to get into the details of the two estimation steps more specifically, Lowe (2008) provides a good overview.

Reference texts for climate change activism/scepticism

Wordscores can thus be particularly useful if you have already ‘manually’ coded parts of the documents in your corpus along the theoreticaly relevant positions. Then ‘amplifying’ you interpretation to a larger set of socuments is at your fingertips.

Admittedly, though, I didn’t have the time to meaningfully code delegate speeches for this tutorial. To nevertheless showcase the precedure I had the idea to scale climate change talk in the UNGA on a dimension between climate change sceptics (or deniers) on the one hand, and activists fighting against climate change , on the other.

Thus I turned to two websites where it is plausible to assume extreme positions on this dimension. This is the Heartland Institute on the one hand - a US-base ‘think tank’, amongst others financed by Exxon and the Koch group, that lobbies hard against climate change mitigation questioning it in principle. On the other hand, it is The Ecologist, a magazine lobbying for sustainability and action on global warning in the ‘post-industrial age’.

From these websites I have scraped all climate-change related news (if you want to know how to collect online information into R, this book gets you started for sure) to collect refernce texts for climate-sceptic and climate-activist language.

Climate sceptics

In the following steps we load the data of climate-change related news from the Heartland Institute (n = 1,322 documents). Before using these texts, I clean out some idiosyncracies stemming from the US-focus of the outlet or the names of its formats and authors. We do not want these terms to influence our scaling later. Again, pre-processing has to be done consciously and should be well documented. Finally we assign a reference score of -1 to these texts.

load("HeartlandCorpus.Rdata") # Heartland Institute - Climate change related news
heartland$text <- paste(heartland$title, heartland$lead, heartland$body, sep = " ")

# Clean out US focus
heartland$text <- gsub("State|Congress|Senate|Washington|President|EPA|federal|United States", " ", heartland$text, fixed = F) 
heartland$text <- gsub("Gov.", " ", heartland$text, fixed = T)
heartland$text <- gsub("Sen.", " ", heartland$text, fixed = T)
states <- read.csv2(file = "US-states.csv") # List of US states
states <- paste(states[,1], collapse="|")
heartland$text <- gsub(states, " ", heartland$text, fixed = F)

# Clean out publication idiosyncracies (authors, pub-titles)
heartland$text <- gsub("Heartland Institute", " ", heartland$text, fixed = T)
heartland$text <- gsub("THE HEARTLAND INSTITUTE", " ", heartland$text, fixed = T)
heartland$text <- gsub("Heartland", " ", heartland$text, fixed = T)
heartland$text <- gsub("Environment & Climate News", " ", heartland$text, fixed = T)
heartland$text <- gsub("ENVIRONMENT & CLIMATE NEWS", " ", heartland$text, fixed = T)
heartland$text <- gsub("Environment and Climate News", " ", heartland$text, fixed = T)
heartland$text <- gsub("Environment Climate News", " ", heartland$text, fixed = T)
heartland$text <- gsub("Climate Change Weekly #", " ", heartland$text, fixed = T)
heartland$text <- gsub("PRESS RELEASE: ", " ", heartland$text, fixed = T)
heartland$text <- gsub("Sterling Burnett", " ", heartland$text, fixed = T)
heartland$text <- gsub("James Taylor", " ", heartland$text, fixed = T)
heartland$text <- gsub("Fred Singer", " ", heartland$text, fixed = T)
heartland$text <- gsub("Nikki Comerford", " ", heartland$text, fixed = T)
heartland$text <- gsub("Bonner Cohen", " ", heartland$text, fixed = T)
heartland$text <- gsub("Bonner R. Cohen", " ", heartland$text, fixed = T)
heartland$text <- gsub("senior fellow", " ", heartland$text, fixed = T)
heartland$text <- gsub("Ph.D.", "", heartland$text, fixed = T)
heartland$text <- gsub(" et(\\.){0,1} al(\\.){0,1} ", " ", heartland$text, fixed = T)

# Collapse to monthly obs
heartland$month <- str_replace(heartland$date, " [0-9]{1,2}, ", " ")
heartland$year <- str_extract(heartland$month, "[0-9]{4}")
sceptictext <- aggregate(text~month, data = heartland, paste, collapse = " ")

# Annotate
sceptictext$group <- "Sceptics"
sceptictext$score <- -1 # Reference score 
sceptictext$month <- NULL

Climate activists

Here we similarly load the reference texts for climate activism from the Ecologist (n= 1,852 documents) and clean out some idiosyncracies.

load("./ClimateRef2/EcologistCorpus.Rdata")

# Clean idiosyncracies
ecologist$text <- gsub("The Ecologist", " ", ecologist$text, ignore.case = T) # Publication specifics
ecologist$text <- gsub("Ecologist", " ", ecologist$text, ignore.case = T) # Publication specifics

ecologist$text <- gsub("UK|Great Britain|London|Heathrow Airport", "", ecologist$text, fixed = F) # UK focus

# Collapse to month
ecologist$month <- str_extract(ecologist$article.links, "[0-9]{4}/[a-z]{3}")
activisttext <- aggregate(text~month, data = ecologist, paste, collapse = " ") # Pool al texts from one month

# Annotate

activisttext$group <- "Activists"
activisttext$score <- 1

activisttext$month <- NULL

Combine and clean

Here we combine both corpora and apply some additional text cleaning steps.
Note that all these cleaning steps point to to a typical pitfall of scaling approaches: the domain or style of language has to match as closely as possible if we do not just want to ‘measure’ stylistic differences only (British and American spellings provide a telling example here).

rawtext <- rbind(activisttext, sceptictext)

rawtext$text <- as_utf8(rawtext$text, normalize = T)
rawtext$text <- utf8_normalize(rawtext$text, map_quote = T)
rawtext$text <- gsub("[a-z]*?@[a-z\\.]*?", " ", rawtext$text, fixed = F) # E-mails
rawtext$text <- gsub("[a-z]*?\\.(org|com)", " ", rawtext$text, fixed = F) # URLs
rawtext$text <- gsub(" u(\\.){0,1}k(\\.){0,1} ", " ", rawtext$text, ignore.case = T)
rawtext$text <- gsub(" u(\\.){0,1}s(\\.){0,1} ", " ", rawtext$text, ignore.case = T)
rawtext$text <- gsub("([a-z]*?)(\\.)([a-z]*?)", "\\1\\2 \\3 ", rawtext$text, fixed = F) # Paste errors

rawtext$text <- gsub("percent", "per cent", rawtext$text, fixed = T) # Harmonize different spellings
rawtext$text <- gsub("sceptic", "skeptic", rawtext$text, fixed = T) # Harmonize different spellings
rawtext$text <- gsub("centre", "center", rawtext$text, fixed = T) # Harmonize different spellings
rawtext$text <- gsub("organiz", "organis", rawtext$text, fixed = T) # Harmonize different spellings
rawtext$text <- gsub("programme", "program", rawtext$text, fixed = T) # Harmonize different spellings
rawtext$text <- gsub("labour", "labor", rawtext$text, fixed = T) # Harmonize different spellings
rawtext$text <- gsub("(\')s", "", rawtext$text, fixed = F) # Possesive s

rawtext$text <- gsub("Donald Trump|Barack Obama|Scott Pruitt", " ", rawtext$text, fixed = F) # Heartland US focus
rawtext$text <- gsub("Trump|Obama|Pruitt", " ", rawtext$text, fixed = F)

The we use quanteda to turn our raw reference texts into a corpus object and then intoa document-frequency matrix.

# Corpus ####
corp <- corpus(rawtext$text, docvars = rawtext[, c(2:3)])

# DFM ####
quanteda_options(language_stemmer = "english")
mat <- dfm(corp, remove = stopwords("english"), tolower = TRUE, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, remove_url = T, stem = F, verbose = TRUE)

Some further cleaning …

# Remove single-character features
features <- as.data.frame(mat@Dimnames$features, stringsAsFactors = FALSE) # Extract all features in the
features$length <- nchar(features[,1]) # Store number of characters per feature
features <- features[features$length <= 1, ] # Keep only single-character features
mat <- dfm_select(mat, pattern = features[,1], selection = "remove", valuetype= "fixed", verbose = T) # Remove them from the dfm
rm(features) # No longer needed

# Remove terms with numbers in them
mat <- dfm_select(mat, pattern = c(".*[0-9].*"), selection = "remove", valuetype= "regex", verbose = T)
mat <- dfm_select(mat, pattern = c("et"), selection = "remove", valuetype= "fixed", verbose = T) # More scientific refernces in Heartland texts
mat <- dfm_select(mat, pattern = c("al"), selection = "remove", valuetype= "fixed", verbose = T)

And now we can have a first glimpse at the most frequent terms in these reference documents - using the tagclouds introduced in Tutorial 1.

# Frequent words in reference texts
textplot_wordcloud(mat, min_size = 1)

# Divisive words
mat.group <- dfm_group(mat, groups = "group")# Group dfm by activist/sceptic type (stored in docvar 'group')
textplot_wordcloud(mat.group, min_size = 2, max_size = 4, max_words = 300, comparison = T, color = c("#0380b5", "#9e3173"))

Apparently, climate-change sceptics seem to talk much more about science - pointing to a well-known strategy to build a counter-epistemic authority (see, e.g. the NIPCC, co-financed by the Heartland Institute). Activists speak much more about forests, communities, and action, for example. The Wordscores models should picks out similar language differences.

Train and analyse the Wordscores model

We use quanteda’s textmodel_wordscores function to train the model, telling it to find the reference scores to be predicted in the document variable ‘score’ in our document-frequency matrix mat we have created above. We store the results of this first estimation step in the object ws.

ws <- textmodel_wordscores(mat, y = docvars(corp, "score"), smooth = 1)

From this object we can now extract the weights of individual terms - again an invaluable information source for validation efforts. Never believe your algorithms blindly.

ws.terms <- as.data.frame(ws$wordscores, stringsAsFactors = F) # extract weights
names(ws.terms) <- "wordscores"
ws.terms$words <- row.names(ws.terms) # extract corresponding words
ws.terms <- ws.terms[order(-ws.terms$wordscores), ] # order by weight
row.names(ws.terms) <- seq(1,nrow(ws.terms),1) # Running counter as row names

It is probably worth sifting through this data frame by view(ws.terms) to get a feeling for the estimation results. To make this more managable for now, the following steps extract the terms with the most extreme word weights plus some in the middle (50 each, building on the sort order of the ws.terms object). Then we plot them with thelp of ggplot.

ws.ex <- rbind(ws.terms[1:50, ], ws.terms[38233:38282, ], ws.terms[(nrow(ws.terms)-49):nrow(ws.terms), ])
ws.ex$run <- rep(seq(1, 50, 1), 3) # 1:50 sequence to place th terms neatly on the plot 

ggplot(data = ws.ex, aes(x=wordscores, y = -1*run))+
  geom_text(aes(label=words, colour = wordscores - mean(ws.terms$wordscores)), size = 4)+
  annotate("text", y = 0.5, x = 0.2537138, label = "<- Skeptics  |  Activists ->", size = 4, hjust = .5)+
  scale_colour_gradient(low = "red", high = "forestgreen",
                        guide = "colourbar", aesthetics = "colour") +
  labs(title = "Example of a Wordscores model\nIdentify language that separates climate change skeptics and activists",
       caption = "Reference texts to train the WS algorithm are climate-change related news from two outlets:\n1) 'Heartland Institute' (Skeptics, reference score = -1, n = 1,322)\n2) 'The Ecologist' (Activists, reference score = 1, n= 1,851)\n\nPlot shows the terms that are closest to the minimum, mean, and maximum of the estimated score distribution\n(61,675 terms scored in total)",
       y="", 
       x= "Wordscores")+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 0, vjust = .5),
        text=element_text(family = "serif"),
        axis.line = element_line(colour = "black"),
        plot.title = element_text(size=14, face='bold'),
        plot.caption =  element_text(size=12, hjust= 0),
        axis.text.y = element_blank(),
        strip.text = element_text(face = "bold", size = 10),
        legend.position = "none",
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), axis.line.x = element_line(colour = "black"),
        axis.line.y = element_blank(),
        axis.ticks.y=element_blank())

ggsave("PL_WordscoresTerms.png", height = 20, width = 24, units = "cm")

Again we see several literal pointers to ‘science’ in the terms that load high on the climate sceptics side. But also term-level pointers to possibly constraining mitigation tools (‘regulations’, ‘restriction’, ‘tax’) figure among the 50 words estimated to be most climate sceptic. And also scpeticism itself expressed in terms such as ‘assertions’, ‘claims’ and especially ‘alarmism’, ‘alarmist’, and ‘alarmists’ pushes a text strongly towards the climate-change scpetic side of things. This appears rather plausible.
On the activist side, terms like ‘deforestation’, ‘sustainable’, ‘biodiversity’ ‘ecological’ and many other terms related to environmental protection load most strongly. This is also plausible, though the distribution of terms suggest sthat climate-change sceptics use a somewhat more idiosyncratic language.

So is this a good model to scale what national delegates in the UNGA say about climate change?

Scale climate change talk in the UNGA

To see this, we firstly have to prepare the ‘virgin’ texts which we had stored in the climatecorp object above. We turn it into a DFM and again pool all speeches into one document by country (earlier disclaimers apply…). Note that the pre-processing of the virgin texts has to match the pre-processing of the text on which the Wordscores model was trained. If we do not set everything to lower case here, for example, the computer does not know that ‘Alarmist’ and ‘alarmist’ are the same word. Once more: Pre-processing is important!

scoredfm <- dfm(climatecorp, remove = stopwords("english"), tolower = TRUE, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, remove_url = T, stem = T, verbose = TRUE)

scoredfm <- dfm_group(scoredfm, group = "country")

Finally, we can predict the UNGA document positions from the Wordscores model we have estimated. We do so here, and calculate 95% confidence intervals from the standard error quanteda’s predict function supplies. Then we add the country identifiers (from the dovars in DFM) to these data.

# Predict document positions and confidence intervals and store them in new data frame
ws.predict <- as.data.frame(predict(ws, se.fit = TRUE, newdata = scoredfm))
names(ws.predict) <- paste("ws.", names(ws.predict), sep ="")
ws.predict$ws.error <- qnorm(0.95)*ws.predict$ws.se.fit # 95% confidence error
ws.predict$ws.lo <- ws.predict$ws.fit - ws.predict$ws.error # lower c.i. bound
ws.predict$ws.hi <- ws.predict$ws.fit + ws.predict$ws.error # upper c.i. bound


docvars <- scoredfm@docvars # Extract document vars
ws.predict <- cbind(ws.predict, docvars) # Combine with Wordscroes predictions
ws.predict <- ws.predict[order(ws.predict$ws.fit), ] # Sort data by Wordscores position

As in the earlier examples, we exploit ggplot look at the 25 most extreme countries on both ends of the latent scale we have tried to capture.

ws.ex <- rbind(ws.predict[1:25, ], ws.predict[170:194, ])

ws.ex$group <- NA
ws.ex$group[1:25] <- "Closest to Climate Sceptic Language"
ws.ex$group[26:50] <- "Closest to Climate Activist Language"

ggplot(data = ws.ex, aes(x = reorder(country, ws.fit), y = ws.fit))+
  geom_pointrange(aes(ymin = ws.lo, ymax = ws.hi, colour = ws.fit - mean(ws.predict$ws.fit)))+
  scale_colour_gradient(low = "red", high = "forestgreen",
                        guide = "colourbar", aesthetics = "colour", name = "Distance from average climate change sentiment: ")+
  labs(title = "Wordscores positions of climate change related speeches in UN General Assembly",
       subtitle = "Wordscores estimates of 100-term windows around 'climate change' / 'global warming' references pooled by country",
       x = "",
       y = "")+
  coord_flip()+
  facet_wrap(vars(group), scales="free_y", ncol = 1, strip.position="right")+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 0, vjust = .5),
        text=element_text(family = "serif"),
        axis.line = element_line(colour = "black"),
        plot.title =element_text(size=12, face='bold'),
        axis.text.y = element_text(size = 10, face = "bold"),
        strip.text = element_text(face = "bold", size = 10),
        legend.position = "none")

ggsave(file = "PL_WordscoresAcrossSelectedCountries.png", width = 24, height = 18, units = "cm")

A first observation is that the pattern is much less clear as compared to the earlier sentiment and especially the Wordfish analyses. For example, this approach does not single out the island states as neatly - we find them on both ends of the retrieved spectrum. Of course, we may be measuring a different dimension now or … the model doesn’t work here well …

A second observation is that the scale on the x-axis is very limited. Recall that our reference texts have sppun a dimension ranging from -1 (climate scpetic) to +1 (climate activist). Wordscores places national delegates in the UNGA only on a range between about .25 and .32, in contrast. In other words, national delegates seem to be closer to the language of climate activists on average, and terms that differentiate between ‘The Ecologist’ and ‘The Heartland Institute’ seem not to not figure prominently in the UNGA speeches. Can you guess why?

Note: Our research design might have violated the fundmental assumption that both reference and virgin texts come from the same language domain. The language used in public communications of lobby groups may be fundamentally different to the lexicon national delegates tend to use in the UNGA!

A thirdobservation is that we face rather larged confidence intervals. On the one hand, some states spoke very rarely about climate chaage, as we learned earlier. With less term-level information, however, scaling estimates will also be less precise. On the other hand, countries may have changed their language (and positions) over time while we have pooled all their climate change talk…

To get another perspective on the validity of the retrieved dimension we apply the crude regression-based approach we have employed earlier.

load("cmeans.Rdata") # Country data 1987-2017 averages, collected from World Bank, Wikipedia, Polity, and Freedom House (crude!)
countries <- merge(ws.predict, cmeans, by = "iso3c", all.x = T) # merge with the aggregated sentiment data
countries <- countries[complete.cases(countries), ] # Keep only countries with information on all variables

# Regression model
fit <- lm(scale(ws.fit)~scale(gdpc) + scale(co2emiss) + scale(fuel.exp) + scale(democracy) + scale(eq.dist) + scale(below5) + scale(island), data = countries)
summary(fit)

## 
## Call:
## lm(formula = scale(ws.fit) ~ scale(gdpc) + scale(co2emiss) + 
##     scale(fuel.exp) + scale(democracy) + scale(eq.dist) + scale(below5) + 
##     scale(island), data = countries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.70725 -0.61906 -0.03245  0.57385  3.02092 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)      -1.201e-16  7.162e-02   0.000   1.0000  
## scale(gdpc)       3.070e-01  1.530e-01   2.007   0.0462 *
## scale(co2emiss)  -3.574e-01  1.538e-01  -2.324   0.0212 *
## scale(fuel.exp)   1.575e-02  8.770e-02   0.180   0.8577  
## scale(democracy)  1.345e-01  8.869e-02   1.517   0.1310  
## scale(eq.dist)   -1.045e-01  8.752e-02  -1.194   0.2338  
## scale(below5)     2.954e-02  7.646e-02   0.386   0.6997  
## scale(island)    -1.500e-01  8.375e-02  -1.791   0.0750 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9898 on 183 degrees of freedom
## Multiple R-squared:  0.05647,    Adjusted R-squared:  0.02038 
## F-statistic: 1.565 on 7 and 183 DF,  p-value: 0.1485

# Coefficient data
reg.data <- coefplot(fit, plot = FALSE) # Get coefficient data
reg.data <- reg.data[reg.data$Coefficient != "(Intercept)", ] # Uninformative in a standardized model
reg.data$name <- c("GDP per capita", "CO2 emissions per capita", "Fossil fuel exports", "Democracy", "Distance to equator", "Area below 5m (%)", "Island state") # Human readable names
reg.data <- reg.data[order(reg.data$Value), ] # Sort in ascending effect size
reg.data$name <- factor(reg.data$name, levels = reg.data$name)

# And plot
ggplot(data = reg.data, aes(x=name, y = Value))+
  geom_hline(yintercept = 0, linetype = "dashed", colour = "red", size = 1)+
  geom_linerange(aes(ymin = LowOuter, ymax = HighOuter), size = .8, colour = "#0380b5")+
  geom_pointrange(aes(ymin = LowInner, ymax = HighInner), size = 1.5, colour = "#0380b5")+
  labs(title = "How do states position themselves on climate change?",
       subtitle = "Explaining the estimated Wordscores position (skeptics/activists) around climate change references by some crude national-level variables",
       y = "Standardized regression coefficient\nwith 95 and 99% confidence intervals\n",
       x = "",
       caption = "Linear Model, n = 191 countries, Adj. R2: .02.")+
  coord_flip()+
  theme_bw()+
  theme(axis.text.x = element_text(size = 14, angle = 0, vjust = .5),
        axis.title = element_text(size = 14),
        axis.text.y = element_text(size = 14, face = "bold", vjust = .3),
        text=element_text(family = "serif"),
        axis.line = element_line(colour = "black"),
        plot.title =element_text(size=16, face='bold'),
        plot.subtitle = element_text(size=14),
        plot.caption = element_text(size = 14))

ggsave(file = "PL_WordscoresOverCountries_Model.png", width = 30, height = 18, units = "cm") # Export to file

This excericise initially suggests that language of richer countries tends towads climate activist end while countries with higher CO2 emmissions tend to use language more similar to climate sceptics. This is not entirely implausible. Yet, note that this model is not only very crude but also fits the data very badly. We ‘explain’ only 2 percent of the variation on the limited range of positions estimated by our Wordscores model.

In sum, this particular application of the Wordscores algorithm did not really result in a good measure of climate change positions expressed in speeches to the United NAtionas General Assembly - most likely because our reference texts do not really match the language used in this forum.
But, hey, you now have the tools to do this much better on your own!

Thanks for bearing with me!
If you have any feedback or questions on these tutorial, contact me via christian-rauh.eu.

Tutorial for Session 5 (Wordscores) Climate change positioning in UN General Assembly Speeches - a supervised approach

Text-mining international politics Block seminar at the Charles University Prague, May 2019

Christian Rauh (2019-04-29 15:32:31)