TEXT MINING
FOR
HISTORY & LITERATURE

Date	Subject	Readings	Assignments
8/30	Overview
9/2	Labor Day
9/4 - 9/6	Character Encodings + Philosophy	Technical Reading: Bits to Characters Discussion: Stephen Ramsay, Toward an Algorithmic Criticism Stephen Marche, Literature is not Data: Against Digital Humanities	Programming 1: Character encodings
9/9 - 9/13	Tokenization and Counting	Technical: David Zentgraf, What Every Programmer...To Work With Text RegexOne: Learn Regular Expressions Christopher Potts, Sentiment Symposium Tutorial: Tokenizing Discussion: Ted Underwood, Where to Start with Text Mining Thomas Dimson, Emojineering Part 1: Machine Learning for Emoji Trends	Programming 2: Tokenization, Counting
9/16 - 9/20	Sentiment analysis	Technical: Matt Jockers "syuzhet", with follow-up work UVM Computational Story Lab, The Shapes of Stories Discussion: summarize and comment on these two approaches to Vonnegut's theory.	Programming 3: Evaluate two sentiment lexicons; Manually create a dictionary-based lexicon for an emotion.
9/23 - 9/27	Classification:	Technical: Victor Powell, Conditional Probability Explained Visually Eliezer S. Yudkowsky, An Intuitive Explanation of Bayes' Theorem Talk: Francisco Iacobelli, Text Classification Using Naive Bayes Discussion: Long and So, Literary Pattern Recognition PDF, also discussed in this audio interview.	Programming 4: What distinguishes History, Tragedy, and Comedy in Shakespeare's plays?
9/30 - 10/4	Measurements of uncertainty; Similarity and Divergence	Technical: The five most common similarity measures implemented in python (with cute animals) Discussion: Barron et al., Individuals, Institutions, and Innovation in the Debates of the French Revolution	Programming 5: Identifying change and variation over time by comparing documents in a sequence.
10/7 - 10/11	Similarity, Clustering, and Authorship	Technical: Testing Burrows's Delta by David L. Hoover. Focus on the introduction. Discussion: How Patrick Juola identified J.K. Rowling's pseudonym (also this more technical version).	Shared data project: Construct a collection of The Federalist Papers. Programming 6: We will apply similarity functions to the Federalist Papers, and see how they imply different clusterings.
10/14	Break
10/16 - 10/18	Similarity, Clustering, and Authorship	Technical: No reading Discussion: No Discussion!	Complete data project on Federalist Papers Programming 6 cont'd: examine differences in keyword use, add clustering and tf-idf
10/21 - 10/25	Corpus-building, clustering	Discussion: Allison et al, Stanford Lit Lab Pamphlet 1. Don't worry about Docuscope, focus on how similarity is being defined and used. How do our notions of genre and overlap relate to mathematical definitions of similarity and clustering?	Programming 7: IDF weighting. Build a collection from Project Gutenberg texts. Extending similarity to clustering. Distinguish Horror from non-Horror.
10/28 - 11/1	Corpus-building, clustering continued	Discussion: Choose TWO (2) of the following articles about constructing a collection: Venice ‘time machine’ project suspended amid data row, Nature News Mandell, Gendering Digital Literary History: What Counts for Digital Humanities Gebru et al., Datasheets for Datasets	Programming 7 Cont'd: IDF weighting. Build a collection from Project Gutenberg texts. Extending similarity to clustering. Distinguish Horror from non-Horror.
11/4 - 11/8	Topic modeling	Technical: Matthew Jockers, The LDA Buffet Boyd-Graber, Hu, and Mimno chapter 1 (read for intuition, don't worry too much about distributions). Discussion: Boyd-Graber, Hu, and Mimno chapters 4 and 6	Programming 8: Training, analyzing, and evaluating topic models.
11/11 - 11/15	Word embeddings	Technical: For Wed: Sebastian Ruder, On word embeddings, Part I. There's lots of material on this site, read more as interested. Discussion: Ryan Heuser, Word vectors in the 18th century episodes 1 and 2. UPDATE: this short document has more of the high-level insights from this work.	Mini-project is due Monday Programming 9: Word embeddings, keywords in context, distance functions.
11/18 - 11/22	Tools for Hypothesis Testing	Technical: Dunning's g-test, blog post, wikipedia article. The Bootstrap, video [Not required, but of interest] Antoniak and Mimno, Evaluating the Stability of Embedding-based Word Similarities, 2017. Discussion: Fisher's Tea-Tasting experiment in Design of Experiments chapter II. You do not need to read chapter I, but it's worth it if you do. (See also this summary of the experiment on wikipedia)	How do we make the connection between text analytics and persuasive arguments? What methods can help us convince ourselves that we're not reporting random values, and how can we explain to others what we've done?
11/25	Multiple hypotheses	Technical: Respectable scientific publication on multiple hypotheses.	Do we need to think about significance differently when we're running many experiments than when we're running just one?
11/27 - 11/29	Thanksgiving, no class

TEXT MINING FOR HISTORY & LITERATURE

TEXT MINING
FOR
HISTORY & LITERATURE