Posterior predictive checks to quantify lack-of-fit in admixture models of latent population structure. David Mimno, David M Blei, Barbara E Engelhardt. Proceedings of the National Academy of Sciences. 112(26):E3341–50, 2015. PNAS arXiv preprint

Admixture models represent the alleles in a genome as a combination of latent ancestral populations. We apply posterior predictive checks to evaluate the quality of fit with respect to functions of interest to population biologists.

How Social Media Non-use Influences the Likelihood of Reversion: Self Control, Being Surveilled, Feeling Freedom, and Socially Connecting. Eric P. S. Baumer, Shion Guha, Emily Quan, David Mimno, and Geri Gay. Social Media and Society, 2015.

Robust Spectral Inference for Joint Stochastic Matrix Factorization. Moontae Lee, David Bindel, and David Mimno. NIPS, 2015, Montreal, QC, Canada.

Evaluation methods for unsupervised word embeddings. Tobias Schnabel, Igor Labutov, David Mimno, and Thorsten Joachims. EMNLP, 2015, Lisbon, Portugal. PDF Data

Word embeddings appear to represent meaning with geometry. But evaluating this property has been difficult. Getting humans to rate the semantic similarity of words is error-prone and time consuming. We use a simpler method to collect human judgments based on pairwise comparisons and odd-one-out detection tasks.

Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements Jordan Boyd-Graber, David Mimno, and David Newman. In Handbook of Mixed Membership Models and Their Applications, CRC/Chapman Hall, 2014. PDF

A description of common pre-processing steps and model checking diagnostics.

Low-dimensional Embeddings for Interpretable Anchor-based Topic Inference Moontae Lee and David Mimno. EMNLP, 2014, Doha, Qatar. (Selected for oral presentation). PDF

In this work we trade an approximate solution to an exact problem for an exact solution to an approximation. We use a proven method in data visualization, the t-SNE projection, to compress a high-dimensional word co-occurrence space to a visualizable two- or three-dimensional space, and then find an exact convex hull. The corners of this convex hull become the anchor words for topics. We find better topics with more salient anchor words, while also improving the interpretability of the algorithm.

Significant Themes in 19th-Century Literature Matthew L. Jockers, David Mimno. Poetics, Dec 2013. Preprint

Models of literature are usually used for exploratory data analysis, but they can also be used to evaluate specific conjectures. We use permutation tests, bootstrap tests, and posterior predictive checks to test some hypotheses about associations between gender, anonymity, and literary themes.

Random Projections for Anchor-based Topic Inference David Mimno. NIPS workshop on Randomized Methods, 2013. PDF

Random projections allow us to scale anchor-based topic finding algorithms to large vocabularies. Projections with structured sparsity, for example holding the number of non-zeros in each row of the random projection fixed, produces better results.

A Practical Algorithm for Topic Modeling with Provable Guarantees Sanjeev Arora, Rong Ge, Yonatan Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, Michael Zhu. ICML, 2013, Atlanta, GA. (Selected for long-form presentation). PDF

Spectral algorithms for LDA have been useful for proving bounds on learning, but it hasn't been clear that they are useful. This paper presents an algorithm that both maintains theoretical guarantees and also provides extremely fast inference. We compared this new algorithm directly to standard MCMC methods on a number of metrics for synthetic and real data.

Scalable Inference of Overlapping Communities Prem Gopalan, David Mimno, Sean Gerrish, Michael J. Freedman, David Blei. NIPS, 2012, Lake Tahoe, NV. (Selected for spotlight presentation)

Sparse stochastic inference for latent Dirichlet allocation David Mimno, Matthew Hoffman and David Blei. ICML, 2012, Edinburgh, Scotland. (Selected for long-form presentation). PDF

Gibbs sampling can be fast if data is sparse, but doesn't scale because it requires us to keep a state variable for every data point. Online stochastic inference can be fast and uses constant memory, but doesn't scale because it can't leverage sparsity. We present a method that uses Gibbs sampling in the local step of a stochastic variational algorithm. The resulting method can process a 33 billion word corpus of 1.2 million books with thousands of topics on a single CPU.

Computational Historiography: Data Mining in a Century of Classics Journals David Mimno. ACM J. of Computing in Cultural Heritage. 5, 1, Article 3 (April 2012), 19 pages. PDF

Topic Models for Taxonomies Anton Bakalov, Andrew McCallum, Hanna Wallach, and David Mimno. Joint Conference on Digital Libraries (JCDL) 2012, Washington, DC. PDF

Database of NIH grants using machine-learned categories and graphical clustering Edmund M Talley, David Newman, David Mimno, Bruce W Herr II, Hanna M Wallach, Gully A P C Burns, A G Miriam Leenders and Andrew McCallum, Nature Methods, Volume 8(7), June 2011, pp. 443--444. HTML

What does the NIH fund, and how are scientific disciplines divided between institutes? In this paper we created a visualization of 100,000 accepted proposals and the 200,000 journal publications associated with those grants.

Reconstructing Pompeian Households David Mimno. UAI, 2011, Barcelona, Spain. (selected for oral presentation) PDF

Houses in Pompeii have several architecturally distinct types of rooms, but it's not always clear what the function of these rooms was, or even if there was a consistent pattern of use across different houses. This work uses statistical models to predict the artifacts found in different rooms.

Bayesian Checking for Topic Models David Mimno, David Blei. EMNLP, 2011, Edinburgh, Scotland. (selected for oral presentation) PDF

This paper measures the degree to which data fits the assumptions of topic models. Topic models represent documents as mixtures of simple, static multinomial distributions over words. We know that real documents exhibit "burstiness": when a word occurs in a document, it tends to occur many times. In this paper, we use a method from Bayesian model checking, posterior predictive checks, to measure the difference between the burstiness we observed and the expectation of the model. We then use this method to search for clusterings of documents based on time or other observed groupings, that best explain the observed burstiness.

Optimizing Semantic Coherence in Topic Models David Mimno, Hanna Wallach, Edmund Talley, Miriam Leenders, Andrew McCallum. EMNLP, 2011, Edinburgh, Scotland. (selected for oral presentation) PDF

We introduce a metric for detecting semantic errors in topic models and develop a completely unsupervised model that specifically tries to improve this metric. Topic models provide a useful method for organizing large document collections into a small number of meaningful word clusters. In practice, however, many topics contain obvious semantic errors that may not reduce predictive power, but significantly weaken user confidence. We find that measuring the probability that lower-ranked words in a topic co-occur in documents with higher-ranked words beats all current methods for detecting a large class of low-quality topics.

Measuring Confidence in Temporal Topic Models with Posterior Predictive Checks David Mimno, David Blei. NIPS Workshop on Computational Social Science and the Wisdom of Crowds, 2010, Whistler, BC.

Rethinking LDA: Why Priors Matter Hanna Wallach, David Mimno and Andrew McCallum. NIPS, 2009, Vancouver, BC. PDF Supplementary Material

Empirically, we have found that optimizing Dirichlet hyperparameters for document-topic distributions in topic models makes a huge difference: topics are not dominated by very common words and topics are more stable as the number of topics increases. In this paper we explore the effects of Dirichlet priors on topic models. The best structure seems to be an asymmetric prior over document-topic distributions and a symmetric prior over topic-word distributions, currently implemented in the MALLET toolkit.

Reconstructing Pompeian Households David Mimno. Applications of Topic Models Workshop, NIPS 2009, Whistler, BC. House data Artifact data PDF (selected for oral presentation)

Pompeii provides a unique view into daily life in a Roman city, but the evidence is noisy and incomplete. This work applies statistical data mining methods originating in text analysis to a database of artifacts found in 30 houses in Pompeii.

Polylingual Topic Models David Mimno, Hanna Wallach, Jason Naradowsky, David Smith and Andrew McCallum. EMNLP, 2009, Singapore. PDF

Standard statistical topic models do not handle multiple languages well, but many important corpora -- particularly outside scientific publications -- contain a mix of many languages. We show that with simple modifications, topic models can leverage not only direct translations but also comparable collections like Wikipedia articles. We demonstrate the system on European parliament proceedings in 12 languages and comparable Wikipedia articles in 14 languages. Code is available in the cc.mallet.topics.PolylingualTopicModel class in the MALLET toolkit.

Evaluation Methods for Topic Models Hanna Wallach, Iain Murray, Ruslan Salakhutdinov and David Mimno. ICML, 2009, Montreal, Quebec. PDF

Held-out likelihood experiments provide an important complement to task-specific evaluations in topic models. We evaluate several methods for calculating held-out likelihoods. Several previously used methods, especially the harmonic mean method, show poor accuracy and high variance compared to a "Chib-style" method and a particle filter-inspired method.

Efficient Methods for Topic Model Inference on Streaming Document Collections Limin Yao, David Mimno and Andrew McCallum. KDD, 2009, Paris, France. PDF slides on fast sampling

Statistical topic modeling has become popular in text processing, but remains computationally intensive. It is often impossible to run standard inference methods on collections because of limited space (eg large IR corpora) and time (eg streaming corpora). In this paper we evaluate a number of methods for lightweight online topic inference, based on models trained from computationally expensive offline processes. In addition, we present SparseLDA, a new data structure and algorithm for Gibbs sampling in multinomial mixture models (such as LDA) that offers substantial improvements in speed and memory usage. A parallelized version of this algorithm is implemented in MALLET.
Error: in section 3.4, the statement "The constant s only changes when we update the hyperparameters α" is incorrect, as the number of words in the old topic and the new topic change by one. In fact, s must be updated before and after sampling a topic for each token, but this update takes a constant number of operations, regardless of the number of topics. This problem was only in the paper — the MALLET implementation has always been correct.

Polylingual Topic Models David Mimno, Hanna Wallach, Limin Yao, Jason Naradowsky and Andrew McCallum. Snowbird Learning Workshop, 2009, Clearwater, FL.

Classics in the Million Book Library Gregory Crane, Alison Babeu, David Bamman, Thomas Breuel, Lisa Cerrato, Daniel Deckers, Anke Lüdeling, David Mimno, Rashmi Singhal, David A. Smith, Amir Zeldes. Digital Humanities Quarterly 3(1), Winter 2009. HTML

In October 2008, Google announced a settlement that will provide access to seven million scanned books while the number of books freely available under an open license from the Internet Archive exceeded one million. The collections and services that classicists have created over the past generation place them in a strategic position to exploit the potential of these collections. This paper concludes with research topics relevant to all humanists on converting page images to text, one language to another, and raw text into machine actionable data.

Gibbs Sampling for Logistic Normal Topic Models with Graph-Based Priors David Mimno, Hanna Wallach and Andrew McCallum. NIPS Workshop on Analyzing Graphs, 2008, Whistler, BC. (one of five out of 22 papers selected for oral presentation) PDF

Dirichlet distributions are a mathematically tractable prior distribution for mixing proportions in Bayesian mixture models, but their convenience comes at the cost of flexibility and expressiveness. Previous work has suggested alternative priors such as logistic normal distributions, extending topic mixture models with covariance matrices and dynamic linear models, but this work has been limited to variational approximations. This paper presents a method for simple, robust Gibbs sampling in logistic normal topic models using an auxiliary variable scheme. Using this method, we extend previous models over linear chains to Gaussian Markov random field priors with arbitrarily structured graphs.

Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression David Mimno and Andrew McCallum. UAI, 2008 (selected for plenary presentation) PDF data

Text documents are usually accompanied by metadata, such as the authors, the publication venue, the date, and any references. Work in topic modeling that has taken such information into account, such as Author-Topic, Citation-Topic, and Topic-over-Time models, has generally focused on constructing specific models that are suited only for one particular type of metadata. This paper presents a simple, unified model for learning topics from documents given arbitrary non-textual features, which can be discrete, categorical, or continuous.

Modeling Career Path Trajectories David Mimno and Andrew McCallum. University of Massachusetts, Amherst Technical Report #2007-69, 2007. PDF

Descriptions of previous work experience in resumes are a valuable source of information about the structure of the job market and the economy. There is, however, a high degree of variability in these documents. Job titles are a particular problem, as they are often either overly sparse or overly general: 85% of job titles in our corpus occur only once, while the most common titles, such as "Consultant", are so broad as to be virtually meaningless. We use a hierarchical hidden state model to discover clusters of words that correspond to distinct skills, clusters of skills that correspond to jobs, and transition patterns between jobs.

Community-based Link Prediction with Text David Mimno, Hanna Wallach, and Andrew McCallum. Statistical Network Modeling Workshop, NIPS, 2007, Whistler, BC.

Expertise Modeling for Matching Papers with Reviewers David Mimno and Andrew McCallum. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) 2007, San Jose, CA. PDF Data

Science depends on peer review, but matching papers with reviewers is a challenging and time consuming task. We compare several automatic methods for measuring the similarity between a submitted abstract and papers previously written by reviewers. These include a novel topic model that automatically divides an author's papers into topically coherent "personas".

Probabilistic Representations for Integrating Unreliable Data Sources David Mimno, Andrew McCallum and Gerome Miklau. IIWeb workshop at AAAI 2007, Vancouver, BC, Canada. PDF

Mixtures of Hierarchical Topics with Pachinko Allocation. David Mimno, Wei Li and Andrew McCallum. International Conference on Machine Learning (ICML) 2007, Corvallis, OR. PDF

The four-level pachinko allocation model (PAM) (Li & McCallum, 2006) represents correlations among topics using a DAG structure. It does not, however, represent a nested hierarchy of topics, with some topical word distributions representing the vocabulary that is shared among several more specic topics. This paper presents hierarchical PAM — an enhancement that explicitly represents a topic hierarchy. This model can be seen as combining the advantages of hLDA's topical hierarchy representation with PAM's ability to mix multiple leaves of the topic hierarchy. Experimental results show improvements in likelihood of held-out documents, as well as mutual information between automatically-discovered topics and human-generated categories such as journals.

Mining a digital library for influential authors. David Mimno and Andrew McCallum. Joint Conference on Digital Libraries (JCDL) 2007, Vancouver, BC, Canada. PDF

Most digital libraries let you search for documents, but we often want to search for people as well. We extract and disambiguate author names from online research papers, weight papers using PageRank on the citation graph, and expand queries using a topic model. We evaluate the system by comparing people returned for the query "information retrieval" to recipients of major awards in IR.

Organizing the OCA: Learning faceted subjects from a library of digital books. David Mimno and Andrew McCallum. Joint Conference on Digital Libraries (JCDL) 2007, Vancouver, BC, Canada. PDF

The Open Content Alliance is one of several large-scale digitization projects currently producing huge numbers of digital books. Statistical topic models are a natural choice for organizing and describing such large text corpora, but scalability becomes a problem when we are dealing with multi-billion word corpora. This paper presents a new method for topic modeling, DCM-LDA. In this model, we train an independent topic model for every book, using pages as "documents". We then gather the topics discovered, cluster them, and then fit a Dirichlet prior for each topic cluster. Finally, we retrain the individual book topic models using these new shared topics.

Beyond Digital Incunabula: Modeling the Next Generation of Digital Libraries. Gregory Crane, David Bamman, Lisa Cerrato, Alison Jones, David Mimno, Adrian Packel, D. Sculley, and Gabriel Weaver. European Conference on Digital Libraries (ECDL) 2006, Alicante, Spain. PDF

Several groups are currently embarking on large scale digitization projects, but are they producing anything more than lots of raw text? This paper argues that such an investment in digitization will be more valuable if accompanied by a parallel investment in highly structured resources such as dictionaries. Several examples, including some I worked on while at Perseus, illustrate this effect.

Bibliometric Impact Measures Leveraging Topic Analysis. Gideon Mann, David Mimno and Andrew McCallum. Joint Conference on Digital Libraries (JCDL) 2006, Chapel Hill, NC. PDF Powerpoint

When evaluating the impact of research papers, it's important to compare similar papers: a massively influential paper in Mathematics may be as well cited as a middling paper in Molecular Biology. We present a system that combines automatic citation analysis on spidered research papers with a new automatic topic model that is aware of multi-word terms. This system is capable of finding fine-grained sub-fields while scaling to the exponential increase in open-access publishing. We evaluate papers from the Rexa digital library using both traditional bibliometric statistics (substituting topics for journals) as well as several new metrics.

Hierarchical Catalog Records: Implementing a FRBR Catalog. David Mimno, Alison Jones and Gregory Crane. DLib, October 2005. HTML

Finding a Catalog: Generating Analytical Catalog Records from Well-structured Digital Texts. David Mimno, Alison Jones and Gregory Crane. Joint Conference on Digital Libraries (JCDL) 2005, Denver, CO. PDF.

Services for a Customizable Authority Linking Environment. Mark Patton and David Mimno. demonstration at Joint Conference on Digital Libraries (JCDL) 2004, Tucson, AZ.

Towards a Cultural Heritage Digital Library. Gregory Crane, Clifford E. Wulfman, Lisa M. Cerrato, Anne Mahoney, Thomas L. Milbank, David Mimno, Jeffrey A. Rydberg-Cox, David A. Smith, and Christopher York. Joint Conference on Digital Libraries (JCDL) 2003, Houston, TX.