TEXT MINING
FOR
HISTORY & LITERATURE

Date Subject Readings Assignments
8/23 - 8/25 Overview Discussion:
8/28 - 9/1 From bits to files Technical:

Discussion:
Week 1: Python 3 warm up. UTF-8 vs. Unicode. Tokenization.
9/4Labor Day
9/6 - 9/8 Counting words, sentiment analysis Technical:
Discussion: summarize and comment on these two approaches to Vonnegut's theory.
Week 2: Evaluate two sentiment lexicons; Manually create a dictionary-based lexicon for an emotion.
9/11 - 9/15 Classification: Technical:
Discussion:
What distinguishes History, Trajedy, and Comedy in Shakespeare's plays?
9/18 - 9/22 Similarity Technical:
Discussion:
Definitions of similarity and 18th century novels. Can we detect genres?
9/25 - 9/29 Corpus-building, clustering Discussion:
  • Allison et al, Stanford Lit Lab Pamphlet 1. Don't worry about Docuscope, focus on how similarity is being defined and used. How do our notions of genre and overlap relate to mathematical definitions of similarity and clustering?
Build a collection from Project Gutenberg texts. Extending similarity to agglomerative clustering.
10/2 - 10/6 More clustering: Agglomerative and k-means Technical:
Discussion:
The idea of clustering is to find the distinctions present in collections rather than try to impose them. How do we know whether this is working? What are the factors we can control, and do they produce different results?
10/9Break, no class
10/11 - 10/13 Documents as mixtures: LSA/PCA Technical:
No Reading Response for this week, but look back at the PCA plots in Stanford Pamphlet 1.
Classification and clustering both make the assumption that a document can be represented as a single archetype (a class, a cluster) plus some "noise". This week we'll start looking at models of documents as combinations of multiple archetypes.
10/16 - 10/20 Topic models Technical:
Discussion:
Like LSA, a topic model represents each document as a combination of factors. The difference is that these "topics" are often easily identifiable as themes.
10/23 - 10/27 Topic models Technical:
Discussion (guest moderator Schofield):
What makes topic models useful, and what can go wrong?
10/30 - 11/3 Word embeddings Technical:
Discussion:
Word embeddings, keywords in context, KL divergence.
11/6 - 11/10 Making arguments Technical AND discussion:
How do we make the connection between text analytics and persuasive arguments? What do statistical methods do, and what do we need to add?
11/13 - 11/17 Tools for Hypothesis Testing Technical:
Discussion:
How do we make the connection between text analytics and persuasive arguments? What methods can help us convince ourselves that we're not reporting random values, and how can we explain to others what we've done?
11/20 Multiple hypotheses Technical:
Do we need to think about significance differently when we're running many experiments than when we're running just one?
11/22 - 11/24Thanksgiving, no class
11/27 - 12/1 Summary and Review Discussion:
How do we balance quantitative analysis with interpretation and cultural knowledge? What are the strengths and limitations of each style?