INFO 6150 / CS 6788

David Mimno, T Th, Spring 2021

The following is a syllabus from 2018, this year is likely to be similar.

Goals: The course will be structured around student research projects. Students will work alone or in teams of up to three people. This work should motivate, describe, and evaluate a novel contribution to our understanding of topic modeling. Areas for work may include new statistical models, inference algorithms, evaluation techniques, design/interface improvements, or corpus-specific case studies. The final result will be a document and supporting materials that look not unlike a conference submission.

Motivation: Topic models, and other unsupervised matrix factorization methods, have emerged as a powerful technique in machine learning and data science because they sit at an ideal point between simplicity and complexity. Data analysts like them because they provide sophisticated insight without attempting real natural language understanding. From an algorithmic standpoint, they are an excellent test case for new inference methods because the topic model objective is hard in the sense that it is multi-modal and non-convex, but it is also reasonably easy to find good approximate solutions.

Contact: Periodically meeting with me to discuss research ideas and course content is an imporant part of the course, both to pursue further angles and to clarify material. Send me two or three times when you could meet to schedule an appointment. We'll discuss a good time for a drop-in office hours time in class. I will try to reply to email within 24 hours.

Weekly Reviews: Each student will submit a short (< 500 words) review of the reading assignments by midnight on the Friday before class through CMS. These will be shared with the presenter for the paper to help organize our discussion. Reviews should consist of three paragraphs:

  1. A short summary of the main idea of the paper. What was the problem, what was done, and how was it evaluated? What are the results
  2. Major points. What were the 1–3 main results for you? We are very often interested in specific parts of papers. What will you remember about this one? Feel free to be critical, but offer specific suggestions or improvements.
  3. Minor points. What else did you notice? This section could include things you thought were cool, or surprising, or that you would like to try.

Programming assignments. Starting with Week 2, we will have programming / mathematical assignments before, during, and after class. We will have

  • Warm-ups: These will be due before class. Examples include deriving equations and preparing data representations.
  • In-class: These problems will be approached in small groups. Do not work alone, I want to hear conversation.
  • Post-class: We will follow up on results from class in the week after each session.

Background reading: Dave Blei's article in Communications of the ACM is a good outline for this course.

Week 1, Class organization. Discuss, implement, and document Monroe et al., Fightin' Words. Stats vs. Data Science Stack Exchange data.
There will be no programming assignment for this week.

Week 2, Unsupervised semantic models with matrix factorization Readings: (Review both, you have two weeks.)

Week 3, Probabilistic admixtures, the Dirichlet Multinomial. Readings: (Select one for your review.)

Week 4, Gibbs sampling, The topic model zoo, or, "Spot the Dirichlet Multinomials!" Warm-up:

Readings: (We will cover each of these briefly, select one for your review.)

Week 5, Word embeddings. word2vec and GloVe.

We will also refer to:

Week 7, Stochastic inference. Spectral and matrix-based inference.

Week 8, Topic Model Evaluation. Reading: (Select two for your review.) IN CLASS: fill out this topic model evaluation form.

Week 9, Embeddings part II: non-Euclidean, bilingual, evaluation.

Week 10, Contextualized embeddings and BERT

Week 11, Networks

Week 12, Project presentations.

Grading: 50% of the grade will be assigned for the independent research project. The remaining 50% will be based on class presentations and participation. Students will take turns leading in-class discussion of papers or presenting reports on implementations for simple algorithms. Students will also be expected to share ideas about their ongoing projects and provide constructive feedback to other members of the class, so participation will be essential. The beginning of the class will focus more on reading, the end of the class will focus on your research projects.