Using Text as Data with R

We live in a digital capitalist world awash in unimaginable amounts of data. A considerable portion of that data takes the form of written language: websites, books, blogs, social media posts, tweets, search results, browser histories, survey data, the list goes on and on. In recent years social science researchers, well-behind enthusiastic adopters in other disciplines (like computer science), have taken serious interest in using text as data in their research. Creative use of text-based data for socialist data science purposes is main area of interest for my personal research. Therefore, the first DS4CS course on socialist data science will be all about finding, preparing, and using text-as-data to advance the cause of socialism.

As I learn more about text-based data science, this course will be updated over time with new modules. The first series of units deal with scraping, preparing, and analyzing a corpus of public domain Marxist texts from Marxists Internet Archive. The MIA Corpus is over 7000 documents by 42 authors at this point, delving into this data set is a substantial project in itself. The initial units of the course will be dedicated to exploring this fascinating body of texts. Beyond just the study of literature, which certainly has its merits, future modules will hone in on more specific real-world applications of text-as-data. I am sure that text-based data has many practical uses for socialists, but the field is wide open at the moment.

Unit 1: Tidy Text Scraping, Cleaning and Processing with Karl Marx and R

Learn about tidy methods of scraping, cleaning, and processing text data using the tidytext package. In the first unit, find out how to gather and prepare a corpus of machine readable text using all three volumes of Karl Marx’s Capital. Tidy text scraping methods are demonstrated for three of the most common forms that text data can be found in: html text from web pages, PDF files, and MS Word documents. Before concluding, some very basic uses of tidy text data are shown like computing summary statistics or visualization with ggplot2().

Unit 2: Peering into Marx’s Critique of Political Economy with Text Analysis using Quanteda

Learn how to use the powerful quanteda ecosystem of packages to conduct text analysis on all three volumes of Karl Marx’s Capital. Topics covered in this unit include processing, cleaning and preparing text-data for use with quanteda and analyzing text using frequency-based methods, term-frequency inverse document frequency or tf-idf, statistical key word measures, word scaling models, and finding the local context for key words. The unit also covers many of the visualization options offered by the quanteda.textplots extension.

Unit 3: Structural Topic Modeling on the (mostly) Collected Works of Marx and Engels with R