Course page on Piazza.
Goals: The course will be structured around student research projects. Students will work alone or in teams of up to three people. This work should motivate, describe, and evaluate a novel contribution to our understanding of topic modeling. Areas for work may include new statistical models, inference algorithms, evaluation techniques, design/interface improvements, or corpus-specific case studies. The final result will be a document and supporting materials that look not unlike a conference submission.
Motivation: Topic models, and other unsupervised matrix factorization methods, have emerged as a powerful technique in machine learning and data science because they sit at an ideal point between simplicity and complexity. Data analysts like them because they provide sophisticated insight without attempting real natural language understanding. From an algorithmic standpoint, they are an excellent test case for new inference methods because the topic model objective is hard in the sense that it is multi-modal and non-convex, but it is also reasonably easy to find good approximate solutions.
Contact: Periodically meeting with me to discuss research ideas and course content is an imporant part of the course, both to pursue further angles and to clarify material. Send me two or three times when you could meet to schedule an appointment. We'll discuss a good time for a drop-in office hours time in class. I will try to reply to email within 24 hours.
Weekly Reviews: Each student will submit a short (< 500 words) review of the reading assignments by midnight on the Friday before class through CMS. These will be shared with the presenter for the paper to help organize our discussion. Reviews should consist of three paragraphs:
- A short summary of the main idea of the paper. What was the problem, what was done, and how was it evaluated? What are the results
- Major points. What were the 1–3 main results for you? We are very often interested in specific parts of papers. What will you remember about this one? Feel free to be critical, but offer specific suggestions or improvements.
- Minor points. What else did you notice? This section could include things you thought were cool, or surprising, or that you would like to try.
Programming assignments. Starting with Week 2, we will have programming / mathematical assignments before, during, and after class. We will have
- Warm-ups: These will be due before class. Examples include deriving equations and preparing data representations.
- In-class: These problems will be approached in small groups. Do not work alone, I want to hear conversation.
- Post-class: We will follow up on results from class in the week after each session.
Background reading: Dave Blei's article in Communications of the ACM is a good outline for this course.
Week 1, Mon Aug 29. Class organization. Exploring a topic model.
There will be no programming assignment for this week.
Week 2, Mon Sep 12. [Sep 5 is Labor Day, no classes]. Unsupervised semantic models, pre-LDA. Readings: (Review both, you have two weeks.)
- Landauer et al., An Introduction to Latent Semantic Analysis. (Laure)
- Brown et al., Class-Based n-gram Models of Natural Language.
Week 3, Mon Sep 19. Probabilistic admixtures, the Dirichlet Multinomial. Readings: (Select one for your review.)
- Hofmann, Probabilistic Latent Semantic Analysis.
- Madsen et al., Modeling Word Burstiness Using the Dirichlet Distribution.
- [suggested] McCallum, Multi-label Text Classification with a Mixture Model Trained by EM.
Week 4, Mon Sep 26. Gibbs sampling, The topic model zoo, or, "Spot the Dirichlet Multinomials!" Warm-up:
- Read this recap of The Dirichlet-multinomial distribution, and do the exercise at the end (derive the predictive distribution of a Dirichlet-multinomial).
- Rosen-Zvi et al., The Author-Topic Model for Authors and Documents.
- Chemudugunta et al., Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model.
- Wallach, Topic Modeling: Beyond Bag-of-Words.
- Griffiths et al., Integrating Topics and Syntax.
Week 5, Mon Oct 3. Evaluation. Reading: (Select two for your review.) IN CLASS: fill out this topic model evaluation form.
- Chang et al., Reading Tea Leaves: How Humans Interpret Topic Models.
- Wallach et al., Evaluation Methods for Topic Models.
- AlSumait et al., Topic Significance Ranking of LDA Generative Models.
- Newman et al., Automatic Evaluation of Topic Coherence.
Week 6, Mon Oct 17. [No class Oct 10 for break] EM and Variational inference. Reading: Kevin Murphy, Chapters 11, 21, 27. (No review required.)
Week 7, Mon Oct 24. Stochastic inference. Spectral and matrix-based inference.
- Hoffman et al., Online Learning for Latent Dirichlet Allocation.
- [suggested] Hoffman et al., Stochastic Variational Inference.
- Background: Arora et al., A Practical Algorithm for Topic Modeling with Provable Guarantees
Week 8, Mon Oct 31. Word embeddings. (I messed up the dates, read if you are able.)
- Mikolov et al., Distributed Representations of Words and Phrases and their Compositionality.
- Pennington et al., GloVe: Global Vectors for Word Representation.
Week 9, Mon Nov 7. More word embeddings. Python code
- Levy and Goldberg, Neural Word Embedding as Implicit Matrix Factorization.
- Levy et al, Improving Distributional Similarity with Lessons Learned from Word Embeddings.
Week 10, Mon Nov 14. Hyperparameters and non-parametric models.
- Wallach, Mimno, McCallum, Rethinking LDA: Why Priors Matter.
- Gershman and Blei, A tutorial on Bayesian nonparametric models.
Week 11, Mon Nov 21. Networks, neural and otherwise.
- Lancichinetti et al., A high-reproducibility and high-accuracy method for automated topic classification.
- Wieting et al., CHARAGRAM: Embedding Words and Sentences via Character n-grams
Week 12, Mon Nov 28. Project presentations.
Grading: 50% of the grade will be assigned for the independent research project. The remaining 50% will be based on class presentations and participation. Students will take turns leading in-class discussion of papers or presenting reports on implementations for simple algorithms. Students will also be expected to share ideas about their ongoing projects and provide constructive feedback to other members of the class, so participation will be essential. The beginning of the class will focus more on reading, the end of the class will focus on your research projects.