Advanced Topic Modeling

The following is a syllabus from 2018, this year is likely to be similar.

Goals: The course will be structured around student research projects. Students will work alone or in teams of up to three people. This work should motivate, describe, and evaluate a novel contribution to our understanding of topic modeling. Areas for work may include new statistical models, inference algorithms, evaluation techniques, design/interface improvements, or corpus-specific case studies. The final result will be a document and supporting materials that look not unlike a conference submission.

Motivation: Topic models, and other unsupervised matrix factorization methods, have emerged as a powerful technique in machine learning and data science because they sit at an ideal point between simplicity and complexity. Data analysts like them because they provide sophisticated insight without attempting real natural language understanding. From an algorithmic standpoint, they are an excellent test case for new inference methods because the topic model objective is hard in the sense that it is multi-modal and non-convex, but it is also reasonably easy to find good approximate solutions.

Contact: Periodically meeting with me to discuss research ideas and course content is an imporant part of the course, both to pursue further angles and to clarify material. Send me two or three times when you could meet to schedule an appointment. We'll discuss a good time for a drop-in office hours time in class. I will try to reply to email within 24 hours.

Weekly Reviews: Each student will submit a short (< 500 words) review of the reading assignments by midnight on the Friday before class through CMS. These will be shared with the presenter for the paper to help organize our discussion. Reviews should consist of three paragraphs:

A short summary of the main idea of the paper. What was the problem, what was done, and how was it evaluated? What are the results
Major points. What were the 1–3 main results for you? We are very often interested in specific parts of papers. What will you remember about this one? Feel free to be critical, but offer specific suggestions or improvements.
Minor points. What else did you notice? This section could include things you thought were cool, or surprising, or that you would like to try.

Programming assignments. Starting with Week 2, we will have programming / mathematical assignments before, during, and after class. We will have

Warm-ups: These will be due before class. Examples include deriving equations and preparing data representations.
In-class: These problems will be approached in small groups. Do not work alone, I want to hear conversation.
Post-class: We will follow up on results from class in the week after each session.

Background reading: Dave Blei's article in Communications of the ACM is a good outline for this course.

Week 1, Class organization. Discuss, implement, and document Monroe et al., Fightin' Words. Stats vs. Data Science Stack Exchange data.
There will be no programming assignment for this week.

Week 2, Unsupervised semantic models with matrix factorization Readings: (Review both, you have two weeks.)

Landauer et al., An Introduction to Latent Semantic Analysis.
Brown et al., Class-Based n-gram Models of Natural Language.

Week 3, Probabilistic admixtures, the Dirichlet Multinomial. Readings: (Select one for your review.)

Hofmann, Probabilistic Latent Semantic Analysis.
Wallach, Mimno, McCallum, Rethinking LDA: Why Priors Matter.
[suggested] McCallum, Multi-label Text Classification with a Mixture Model Trained by EM.

Week 4, Gibbs sampling, The topic model zoo, or, "Spot the Dirichlet Multinomials!" Warm-up:

Read this recap of The Dirichlet-multinomial distribution, and do the exercise at the end (derive the predictive distribution of a Dirichlet-multinomial).

Readings: (We will cover each of these briefly, select one for your review.)

Rosen-Zvi et al., The Author-Topic Model for Authors and Documents.
Chemudugunta et al., Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model.
Wallach, Topic Modeling: Beyond Bag-of-Words.
Griffiths et al., Integrating Topics and Syntax.

Week 5, Word embeddings. word2vec and GloVe.

Bojanowski et al., Enriching Word Vectors with Subword Information.
Levy and Goldberg, Neural Word Embedding as Implicit Matrix Factorization.
Levy et al, Improving Distributional Similarity with Lessons Learned from Word Embeddings.

We will also refer to:

Mikolov et al., Distributed Representations of Words and Phrases and their Compositionality.
Pennington et al., GloVe: Global Vectors for Word Representation.

Week 7, Stochastic inference. Spectral and matrix-based inference.

Hoffman et al., Online Learning for Latent Dirichlet Allocation.
[suggested] Hoffman et al., Stochastic Variational Inference.
Background: Arora et al., A Practical Algorithm for Topic Modeling with Provable Guarantees

Week 8, Topic Model Evaluation. Reading: (Select two for your review.) IN CLASS: fill out this topic model evaluation form.

Chang et al., Reading Tea Leaves: How Humans Interpret Topic Models.
Wallach et al., Evaluation Methods for Topic Models.
AlSumait et al., Topic Significance Ranking of LDA Generative Models.
Newman et al., Automatic Evaluation of Topic Coherence.

Week 9, Embeddings part II: non-Euclidean, bilingual, evaluation.

Nickel et al., Poincaré Embeddings for Learning Hierarchical Representations
Schnabel et al., Evaluation methods for unsupervised word embeddings
Upadhyay et al., Cross-lingual Models of Word Embeddings: An Empirical Comparison (a good survey of both bilingual embeddings and evaluations)

Week 10, Contextualized embeddings and BERT

Wieting et al., CHARAGRAM: Embedding Words and Sentences via Character n-grams
Peters et al., Deep contextualized word representations (page links to arXiv PDF)
Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, with code

Week 11, Networks

Lancichinetti et al., A high-reproducibility and high-accuracy method for automated topic classification.
Ball et al., An efficient and principled method for detecting communities in networks.

Week 12, Project presentations.

Grading: 50% of the grade will be assigned for the independent research project. The remaining 50% will be based on class presentations and participation. Students will take turns leading in-class discussion of papers or presenting reports on implementations for simple algorithms. Students will also be expected to share ideas about their ongoing projects and provide constructive feedback to other members of the class, so participation will be essential. The beginning of the class will focus more on reading, the end of the class will focus on your research projects.