TEXT MINING
FOR
HISTORY & LITERATURE

Date Subject Readings Assignments
8/30 Overview
9/2Labor Day
9/4 - 9/6 Character Encodings + Philosophy Technical Reading: Bits to Characters

Discussion:

Programming 1: Character encodings
9/9 - 9/13 Tokenization and Counting Technical:

Discussion:
Programming 2: Tokenization, Counting
9/16 - 9/20 Sentiment analysis Technical:
Discussion: summarize and comment on these two approaches to Vonnegut's theory.
Programming 3: Evaluate two sentiment lexicons; Manually create a dictionary-based lexicon for an emotion.
9/23 - 9/27 Classification: Technical:
Discussion:
Programming 4: What distinguishes History, Tragedy, and Comedy in Shakespeare's plays?
9/30 - 10/4 Measurements of uncertainty; Similarity and Divergence Technical:
Discussion:
Programming 5: Identifying change and variation over time by comparing documents in a sequence.
10/7 - 10/11 Similarity, Clustering, and Authorship Technical:
Discussion:

Shared data project: Construct a collection of The Federalist Papers.

Programming 6: We will apply similarity functions to the Federalist Papers, and see how they imply different clusterings.

10/14Break
10/16 - 10/18 Similarity, Clustering, and Authorship Technical: No reading
Discussion: No Discussion!

Complete data project on Federalist Papers

Programming 6 cont'd: examine differences in keyword use, add clustering and tf-idf

10/21 - 10/25 Corpus-building, clustering Discussion:
  • Allison et al, Stanford Lit Lab Pamphlet 1. Don't worry about Docuscope, focus on how similarity is being defined and used. How do our notions of genre and overlap relate to mathematical definitions of similarity and clustering?
Programming 7: IDF weighting. Build a collection from Project Gutenberg texts. Extending similarity to clustering. Distinguish Horror from non-Horror.
10/28 - 11/1 Corpus-building, clustering continued Discussion:
Choose TWO (2) of the following articles about constructing a collection:
Programming 7 Cont'd: IDF weighting. Build a collection from Project Gutenberg texts. Extending similarity to clustering. Distinguish Horror from non-Horror.
11/4 - 11/8 Topic modeling Technical:
Discussion:
Programming 8: Training, analyzing, and evaluating topic models.
11/11 - 11/15 Word embeddings Technical:
Discussion:
Mini-project is due Monday
Programming 9: Word embeddings, keywords in context, distance functions.
11/18 - 11/22 Tools for Hypothesis Testing Technical:
Discussion:
How do we make the connection between text analytics and persuasive arguments? What methods can help us convince ourselves that we're not reporting random values, and how can we explain to others what we've done?
11/25 Multiple hypotheses Technical:
Do we need to think about significance differently when we're running many experiments than when we're running just one?
11/27 - 11/29Thanksgiving, no class