Computational Humanities. ed. Jessica Marie Johnson, Lauren Tilton, David Mimno. Debates in Digital Humanities, 2024.

Humanities and Human-Centered Machine Learning. Laure Thompson and David Mimno. In Human-centered Machine Learning, 2023. preprint

A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity. Shayne Longpre, Gregory Yauney, Emily Reif, Katherine Lee, Adam Roberts, Barret Zoph, Denny Zhou, Jason Wei, Kevin Robinson, David Mimno, Daphne Ippolito. arXiv

Modeling Legal Reasoning: LM Annotation at the Edge of Human Agreement. Rosamond Elizabeth Thalken, Edward Stiglitz, David Mimno, Matthew Wilkens. EMNLP, 2023.

Data Similarity is Not Enough to Explain Language Model Performance. Gregory Yauney, Emily Reif, David Mimno. EMNLP, 2023.

Hyperpolyglot LLMs: cross-lingual interpretability in token embeddings. Andrea W Wang, David Mimno. EMNLP, 2023.

The Chatbot and the Canon: Poetry Memorization in LLMs. Lyra D'Souza, David Mimno. CHR 2023.

T5 meets Tybalt: author attribution in Early Modern English drama using large language models. Rebecca Hicke, David Mimno. CHR 2023.

Contextualized Topic Coherence Metrics. Hamed Rahimi, Jacob Louis Hoover, David Mimno, Hubert Naacke, Camelia Constantin, Bernd Amann. arXiv

Honest students from untrusted teachers: Learning an interpretable question-answering pipeline from a pretrained language model. Jacob Eisenstein, Daniel Andor, Bernd Bohnet, Michael Collins, David Mimno. arXiv

Breaking BERT: Evaluating and Optimizing Sparsified Attention. Siddhartha Brahma, Polina Zablotskaia, David Mimno. Sparsity in Neural Networks, 2021 arXiv

Comparing text representations: A theory-driven approach. Gregory Yauney, David Mimno. EMNLP 2021. PDF

Tecnologica cosa: Modeling Storyteller Personalities in Boccaccio's Decameron. A Feder Cooper, Maria Antoniak, Christopher De Sa, Marilyn Migiel, David Mimno. LaTech-CLfL 2021. arXiv

On-the-Fly Rectification for Robust Large-Vocabulary Topic Inference. Moontae Lee, Sungjun Cho, Kun Dong, David Mimno, David Bindel. ICML, 2021 PDF

Separating the wheat from the chaff: A topic and keyword-based procedure for identifying research-relevant text. Alicia Eads, Alexandra Schofield, Fauna Mahootian, David Mimno, and Rens Wilderom. Poetics, 2021. PDF

Like Two Pis in a Pod: Author Similarity Across Time in the Ancient Greek Corpus. Grant Storey and David Mimno. Journal of Cultural Analytics. July 2020. HTML

Prior-aware Composition Inference for Spectral Topic Models. Moontae Lee, David Bindel, and David Mimno. AIStats 2020.

Combatting The Challenges of Local Privacy for Distributional Semantics with Compression. Alexandra Schofield, Gregory Yauney, and David Mimno. PriML Workshop, NeurIPS, 2019. PDF

Narrative Paths and Negotiation of Power in Birth Stories. Maria Antoniak, David Mimno, and Karen Levy. CSCW, 2019, Austin, TX. PDF

Online birth stories — people's narrative descriptions of giving birth — provide a valuable resource for the study of narrative. Each is unique, yet they all share a similar plot structure and cast of characters. Analyzing the language of "character" interactions provides a view into power and agency.

Unsupervised Discovery of Multimodal Links in Multi-image, Multi-sentence Documents. Jack Hessel, Lillian Lee, and David Mimno. EMNLP, 2019, Hong Kong, China. PDF

Datasets that have text captions for images are relatively rare. It's much more common to find documents that have text, and images, but no specific link between them. We show that using only a large set of such multi-modal documents, we can create a system that can predict connections between specific sentences and images.

Practical Correlated Topic Modeling and Analysis via the Rectified Anchor Word Algorithm. Moontae Lee, Sungjun Cho, David Bindel, and David Mimno. EMNLP, 2019, Hong Kong, China. PDF

Boosted negative sampling by quadratically constrained entropy maximization. Taygun Kekeç, David Mimno, and David M.J. Tax. Pattern Recognition Letters Volume 125, 1 July 2019, Pages 310-317. HTML PDF

Why does negative sampling work? We define some properties of optimal negative sampling distributions, and show that existing heuristic distributions are approximations to optimal distributions.

Computational Cut-Ups: The Influence of Dada. Laure Thompson and David Mimno. The Journal of Modern Periodical Studies vol. 8, no. 2, 2018, pp. 179–195. JSTOR preprint

Can we produce a Dada reading of Dada itself? We used a pre-trained convolutional neural network to generate a computational "definition" of Dada visual art from a collection of modernist journals. Looking at the "success" and "failure" of the resulting image classifier provides insights into the Dada aesthetic, but also into the visual distinctions that are afforded by CNNs.

Authorless Topic Models: Biasing Models Away from Known Structure. Laure Thompson and David Mimno. COLING, 2018. [Winner: Best NLP Engineering Experiment] PDF Code

Unsupervised clustering methods find the strongest patterns in collections. But what if the clearest pattern is something you already know? For example, topic models trained on novels don't find themes, they find book series. We show that you can predict which words correlate with known metadata, and that simple, easily auditable transformations to input text can change model output in useful, predictable ways.

Quantifying the Visual Concreteness of Words and Topics in Multimodal Datasets. Jack Hessel, David Mimno, and Lillian Lee. NAACL 2018. PDF

We expect images and text to convey related but distinct information, but how much connection is there, and how much does it vary? Certain words map more directly to image representations than others: it's a lot easier to predict the appearance of a picture marked "squirrel" than "beautiful".

Evaluating the Stability of Embedding-based Word Similarities. Maria Antoniak and David Mimno. TACL (6) 2018, 107--119 PDF

Word embeddings are unstable, especially for smaller collections. Lists of most-similar words can vary considerably between random initializaions, even for methods that appear deterministic. For more reliable results, train embeddings on at least 25 bootstrap-sampled corpora and calculate average word similarities.

Quantifying the Effects of Text Duplication on Semantic Models. Alexandra Schofield, Laure Thompson, and David Mimno. EMNLP 2017 PDF

The Strange Geometry of Skip-Gram with Negative Sampling. David Mimno and Laure Thompson. EMNLP 2017 PDF (best paper honorable mention)

Most people assume that word embedding vectors are determined by semantics. In fact, in the popular SGNS algorithm popularized by word2vec the negative sampling objective dominates, resulting in vectors that lie within a narrow cone.

Applications of Topic Models. Jordan Boyd-Graber, Yuening Hu, and David Mimno. Foundations and Trends in Information Retrieval, Now Publishers. 2017.

Comparing Grounded Theory and Topic Modeling: Extreme Divergence or Unlikely Convergence? Eric P. S. Baumer, David Mimno, Shion Guha, Emily Quan, Geri K. Gay. JASIST 68(6) 2017. PDF

We draw parallels between two methods for building high-level concepts that are grounded in documents. Both provide a useful perspective on the other.

Pulling Out the Stops: Rethinking Stopword Removal for Topic Models. Alexandra Schofield, Måns Magnusson, and David Mimno. EACL 2017 PDF

Removing the dozen or so most frequent words in a corpus seems to have a big effect on topic models. Beyond that, however, it's hard to tell the difference between models that had stopwords removed before and after training.

The Tell-Tale Hat: Surfacing the Uncertainty in Folklore Classification. Peter M. Broadwell, David Mimno and Timothy R. Tangherlini. Cultural Analytics, February 2017. HTML

Cats and Captions vs. Creators and the Clock: Comparing Multimodal Content to Context in Predicting Relative Popularity. Hessel, Jack, Lillian Lee, and David Mimno. WWW 2017

If you want to predict the popularity of online content, you need to consider social effects and the time of posting. In some cases a matter of seconds can be significant.

Beyond Exchangeability: The Chinese Voting Process. Moontae Lee, Seok Hyun Jin, and David Mimno. NIPS 2016

Helpfulness votes in Amazon and StackExchange forums are sensitive to social pressure. Accounting for forum-specific biases helps identify over- and under-rated content, and quantifies forum culture.

Comparing Apples to Apple: The Effects of Stemmers on Topic Models. Alexandra Schofield and David Mimno. Transactions of the Association for Computational Linguistics 4 (2016): 287-300. PDF

Using a stemmer is unlikely to improve topic model results. If you're worried about displaying small variations of the same word over and over, use the stemmer after training to group words together for display.

Machine Learning and Grounded Theory Method: Convergence, Divergence, and Combination. Michael Muller, Shion Guha, Eric P. S. Baumer, David Mimno, and N. Sadat Shami. GROUP 2016. PDF

Missing Photos, Suffering Withdrawal, or Finding Freedom? How Experiences of Social Media Non-Use Influence the Likelihood of Reversion. Eric P. S. Baumer, Shion Guha, Emily Quan, David Mimno, Geri K. Gay. Social Media and Society. November 2015. HTML

Posterior predictive checks to quantify lack-of-fit in admixture models of latent population structure. David Mimno, David M Blei, Barbara E Engelhardt. Proceedings of the National Academy of Sciences. 112(26):E3341–50, 2015. PNAS arXiv preprint

Admixture models represent the alleles in a genome as a combination of latent ancestral populations. We apply posterior predictive checks to evaluate the quality of fit with respect to functions of interest to population biologists.

How Social Media Non-use Influences the Likelihood of Reversion: Self Control, Being Surveilled, Feeling Freedom, and Socially Connecting. Eric P. S. Baumer, Shion Guha, Emily Quan, David Mimno, and Geri Gay. Social Media and Society, 2015.

Robust Spectral Inference for Joint Stochastic Matrix Factorization. Moontae Lee, David Bindel, and David Mimno. NIPS, 2015, Montreal, QC, Canada.

Evaluation methods for unsupervised word embeddings. Tobias Schnabel, Igor Labutov, David Mimno, and Thorsten Joachims. EMNLP, 2015, Lisbon, Portugal. PDF Data

Word embeddings appear to represent meaning with geometry. But evaluating this property has been difficult. Getting humans to rate the semantic similarity of words is error-prone and time consuming. We use a simpler method to collect human judgments based on pairwise comparisons and odd-one-out detection tasks.

Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements Jordan Boyd-Graber, David Mimno, and David Newman. In Handbook of Mixed Membership Models and Their Applications, CRC/Chapman Hall, 2014. PDF

A description of common pre-processing steps and model checking diagnostics.

Low-dimensional Embeddings for Interpretable Anchor-based Topic Inference Moontae Lee and David Mimno. EMNLP, 2014, Doha, Qatar. (Selected for oral presentation). PDF

In this work we trade an approximate solution to an exact problem for an exact solution to an approximation. We use a proven method in data visualization, the t-SNE projection, to compress a high-dimensional word co-occurrence space to a visualizable two- or three-dimensional space, and then find an exact convex hull. The corners of this convex hull become the anchor words for topics. We find better topics with more salient anchor words, while also improving the interpretability of the algorithm.

Significant Themes in 19th-Century Literature Matthew L. Jockers, David Mimno. Poetics, Dec 2013. Preprint

Models of literature are usually used for exploratory data analysis, but they can also be used to evaluate specific conjectures. We use permutation tests, bootstrap tests, and posterior predictive checks to test some hypotheses about associations between gender, anonymity, and literary themes.

Random Projections for Anchor-based Topic Inference David Mimno. NIPS workshop on Randomized Methods, 2013. PDF

Random projections allow us to scale anchor-based topic finding algorithms to large vocabularies. Projections with structured sparsity, for example holding the number of non-zeros in each row of the random projection fixed, produces better results.

A Practical Algorithm for Topic Modeling with Provable Guarantees Sanjeev Arora, Rong Ge, Yonatan Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, Michael Zhu. ICML, 2013, Atlanta, GA. (Selected for long-form presentation). PDF

Spectral algorithms for LDA have been useful for proving bounds on learning, but it hasn't been clear that they are useful. This paper presents an algorithm that both maintains theoretical guarantees and also provides extremely fast inference. We compared this new algorithm directly to standard MCMC methods on a number of metrics for synthetic and real data.

Scalable Inference of Overlapping Communities Prem Gopalan, David Mimno, Sean Gerrish, Michael J. Freedman, David Blei. NIPS, 2012, Lake Tahoe, NV. (Selected for spotlight presentation)

Sparse stochastic inference for latent Dirichlet allocation David Mimno, Matthew Hoffman and David Blei. ICML, 2012, Edinburgh, Scotland. (Selected for long-form presentation). PDF

Gibbs sampling can be fast if data is sparse, but doesn't scale because it requires us to keep a state variable for every data point. Online stochastic inference can be fast and uses constant memory, but doesn't scale because it can't leverage sparsity. We present a method that uses Gibbs sampling in the local step of a stochastic variational algorithm. The resulting method can process a 33 billion word corpus of 1.2 million books with thousands of topics on a single CPU.

Computational Historiography: Data Mining in a Century of Classics Journals David Mimno. ACM J. of Computing in Cultural Heritage. 5, 1, Article 3 (April 2012), 19 pages. PDF

Topic Models for Taxonomies Anton Bakalov, Andrew McCallum, Hanna Wallach, and David Mimno. Joint Conference on Digital Libraries (JCDL) 2012, Washington, DC. PDF

Database of NIH grants using machine-learned categories and graphical clustering Edmund M Talley, David Newman, David Mimno, Bruce W Herr II, Hanna M Wallach, Gully A P C Burns, A G Miriam Leenders and Andrew McCallum, Nature Methods, Volume 8(7), June 2011, pp. 443--444. HTML PDF

What does the NIH fund, and how are scientific disciplines divided between institutes? In this paper we created a visualization of 100,000 accepted proposals and the 200,000 journal publications associated with those grants.

Reconstructing Pompeian Households David Mimno. UAI, 2011, Barcelona, Spain. (selected for oral presentation) PDF

Houses in Pompeii have several architecturally distinct types of rooms, but it's not always clear what the function of these rooms was, or even if there was a consistent pattern of use across different houses. This work uses statistical models to predict the artifacts found in different rooms.

Bayesian Checking for Topic Models David Mimno, David Blei. EMNLP, 2011, Edinburgh, Scotland. (selected for oral presentation) PDF

This paper measures the degree to which data fits the assumptions of topic models. Topic models represent documents as mixtures of simple, static multinomial distributions over words. We know that real documents exhibit "burstiness": when a word occurs in a document, it tends to occur many times. In this paper, we use a method from Bayesian model checking, posterior predictive checks, to measure the difference between the burstiness we observed and the expectation of the model. We then use this method to search for clusterings of documents based on time or other observed groupings, that best explain the observed burstiness.

Optimizing Semantic Coherence in Topic Models David Mimno, Hanna Wallach, Edmund Talley, Miriam Leenders, Andrew McCallum. EMNLP, 2011, Edinburgh, Scotland. (selected for oral presentation) PDF

We introduce a metric for detecting semantic errors in topic models and develop a completely unsupervised model that specifically tries to improve this metric. Topic models provide a useful method for organizing large document collections into a small number of meaningful word clusters. In practice, however, many topics contain obvious semantic errors that may not reduce predictive power, but significantly weaken user confidence. We find that measuring the probability that lower-ranked words in a topic co-occur in documents with higher-ranked words beats all current methods for detecting a large class of low-quality topics.

Measuring Confidence in Temporal Topic Models with Posterior Predictive Checks David Mimno, David Blei. NIPS Workshop on Computational Social Science and the Wisdom of Crowds, 2010, Whistler, BC.

Rethinking LDA: Why Priors Matter Hanna Wallach, David Mimno and Andrew McCallum. NIPS, 2009, Vancouver, BC. PDF Supplementary Material

Empirically, we have found that optimizing Dirichlet hyperparameters for document-topic distributions in topic models makes a huge difference: topics are not dominated by very common words and topics are more stable as the number of topics increases. In this paper we explore the effects of Dirichlet priors on topic models. The best structure seems to be an asymmetric prior over document-topic distributions and a symmetric prior over topic-word distributions, currently implemented in the MALLET toolkit.

Reconstructing Pompeian Households David Mimno. Applications of Topic Models Workshop, NIPS 2009, Whistler, BC. House data Artifact data PDF (selected for oral presentation)

Pompeii provides a unique view into daily life in a Roman city, but the evidence is noisy and incomplete. This work applies statistical data mining methods originating in text analysis to a database of artifacts found in 30 houses in Pompeii.

Polylingual Topic Models David Mimno, Hanna Wallach, Jason Naradowsky, David Smith and Andrew McCallum. EMNLP, 2009, Singapore. PDF

Standard statistical topic models do not handle multiple languages well, but many important corpora -- particularly outside scientific publications -- contain a mix of many languages. We show that with simple modifications, topic models can leverage not only direct translations but also comparable collections like Wikipedia articles. We demonstrate the system on European parliament proceedings in 12 languages and comparable Wikipedia articles in 14 languages. Code is available in the cc.mallet.topics.PolylingualTopicModel class in the MALLET toolkit.

Evaluation Methods for Topic Models Hanna Wallach, Iain Murray, Ruslan Salakhutdinov and David Mimno. ICML, 2009, Montreal, Quebec. PDF

Held-out likelihood experiments provide an important complement to task-specific evaluations in topic models. We evaluate several methods for calculating held-out likelihoods. Several previously used methods, especially the harmonic mean method, show poor accuracy and high variance compared to a "Chib-style" method and a particle filter-inspired method.

Efficient Methods for Topic Model Inference on Streaming Document Collections Limin Yao, David Mimno and Andrew McCallum. KDD, 2009, Paris, France. PDF slides on fast sampling

Statistical topic modeling has become popular in text processing, but remains computationally intensive. It is often impossible to run standard inference methods on collections because of limited space (eg large IR corpora) and time (eg streaming corpora). In this paper we evaluate a number of methods for lightweight online topic inference, based on models trained from computationally expensive offline processes. In addition, we present SparseLDA, a new data structure and algorithm for Gibbs sampling in multinomial mixture models (such as LDA) that offers substantial improvements in speed and memory usage. A parallelized version of this algorithm is implemented in MALLET.
Error: in section 3.4, the statement "The constant s only changes when we update the hyperparameters α" is incorrect, as the number of words in the old topic and the new topic change by one. In fact, s must be updated before and after sampling a topic for each token, but this update takes a constant number of operations, regardless of the number of topics. This problem was only in the paper — the MALLET implementation has always been correct.

Polylingual Topic Models David Mimno, Hanna Wallach, Limin Yao, Jason Naradowsky and Andrew McCallum. Snowbird Learning Workshop, 2009, Clearwater, FL.

Classics in the Million Book Library Gregory Crane, Alison Babeu, David Bamman, Thomas Breuel, Lisa Cerrato, Daniel Deckers, Anke Lüdeling, David Mimno, Rashmi Singhal, David A. Smith, Amir Zeldes. Digital Humanities Quarterly 3(1), Winter 2009. HTML

In October 2008, Google announced a settlement that will provide access to seven million scanned books while the number of books freely available under an open license from the Internet Archive exceeded one million. The collections and services that classicists have created over the past generation place them in a strategic position to exploit the potential of these collections. This paper concludes with research topics relevant to all humanists on converting page images to text, one language to another, and raw text into machine actionable data.

Gibbs Sampling for Logistic Normal Topic Models with Graph-Based Priors David Mimno, Hanna Wallach and Andrew McCallum. NIPS Workshop on Analyzing Graphs, 2008, Whistler, BC. (one of five out of 22 papers selected for oral presentation) PDF

Dirichlet distributions are a mathematically tractable prior distribution for mixing proportions in Bayesian mixture models, but their convenience comes at the cost of flexibility and expressiveness. Previous work has suggested alternative priors such as logistic normal distributions, extending topic mixture models with covariance matrices and dynamic linear models, but this work has been limited to variational approximations. This paper presents a method for simple, robust Gibbs sampling in logistic normal topic models using an auxiliary variable scheme. Using this method, we extend previous models over linear chains to Gaussian Markov random field priors with arbitrarily structured graphs.

Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression David Mimno and Andrew McCallum. UAI, 2008 (selected for plenary presentation) PDF data

Text documents are usually accompanied by metadata, such as the authors, the publication venue, the date, and any references. Work in topic modeling that has taken such information into account, such as Author-Topic, Citation-Topic, and Topic-over-Time models, has generally focused on constructing specific models that are suited only for one particular type of metadata. This paper presents a simple, unified model for learning topics from documents given arbitrary non-textual features, which can be discrete, categorical, or continuous.

Modeling Career Path Trajectories David Mimno and Andrew McCallum. University of Massachusetts, Amherst Technical Report #2007-69, 2007. PDF

Descriptions of previous work experience in resumes are a valuable source of information about the structure of the job market and the economy. There is, however, a high degree of variability in these documents. Job titles are a particular problem, as they are often either overly sparse or overly general: 85% of job titles in our corpus occur only once, while the most common titles, such as "Consultant", are so broad as to be virtually meaningless. We use a hierarchical hidden state model to discover clusters of words that correspond to distinct skills, clusters of skills that correspond to jobs, and transition patterns between jobs.

Community-based Link Prediction with Text David Mimno, Hanna Wallach, and Andrew McCallum. Statistical Network Modeling Workshop, NIPS, 2007, Whistler, BC.

Expertise Modeling for Matching Papers with Reviewers David Mimno and Andrew McCallum. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) 2007, San Jose, CA. PDF Data

Science depends on peer review, but matching papers with reviewers is a challenging and time consuming task. We compare several automatic methods for measuring the similarity between a submitted abstract and papers previously written by reviewers. These include a novel topic model that automatically divides an author's papers into topically coherent "personas".

Probabilistic Representations for Integrating Unreliable Data Sources David Mimno, Andrew McCallum and Gerome Miklau. IIWeb workshop at AAAI 2007, Vancouver, BC, Canada. PDF

Mixtures of Hierarchical Topics with Pachinko Allocation. David Mimno, Wei Li and Andrew McCallum. International Conference on Machine Learning (ICML) 2007, Corvallis, OR. PDF

The four-level pachinko allocation model (PAM) (Li & McCallum, 2006) represents correlations among topics using a DAG structure. It does not, however, represent a nested hierarchy of topics, with some topical word distributions representing the vocabulary that is shared among several more specic topics. This paper presents hierarchical PAM — an enhancement that explicitly represents a topic hierarchy. This model can be seen as combining the advantages of hLDA's topical hierarchy representation with PAM's ability to mix multiple leaves of the topic hierarchy. Experimental results show improvements in likelihood of held-out documents, as well as mutual information between automatically-discovered topics and human-generated categories such as journals.

Mining a digital library for influential authors. David Mimno and Andrew McCallum. Joint Conference on Digital Libraries (JCDL) 2007, Vancouver, BC, Canada. PDF

Most digital libraries let you search for documents, but we often want to search for people as well. We extract and disambiguate author names from online research papers, weight papers using PageRank on the citation graph, and expand queries using a topic model. We evaluate the system by comparing people returned for the query "information retrieval" to recipients of major awards in IR.

Organizing the OCA: Learning faceted subjects from a library of digital books. David Mimno and Andrew McCallum. Joint Conference on Digital Libraries (JCDL) 2007, Vancouver, BC, Canada. PDF

The Open Content Alliance is one of several large-scale digitization projects currently producing huge numbers of digital books. Statistical topic models are a natural choice for organizing and describing such large text corpora, but scalability becomes a problem when we are dealing with multi-billion word corpora. This paper presents a new method for topic modeling, DCM-LDA. In this model, we train an independent topic model for every book, using pages as "documents". We then gather the topics discovered, cluster them, and then fit a Dirichlet prior for each topic cluster. Finally, we retrain the individual book topic models using these new shared topics.

Beyond Digital Incunabula: Modeling the Next Generation of Digital Libraries. Gregory Crane, David Bamman, Lisa Cerrato, Alison Jones, David Mimno, Adrian Packel, D. Sculley, and Gabriel Weaver. European Conference on Digital Libraries (ECDL) 2006, Alicante, Spain. PDF

Several groups are currently embarking on large scale digitization projects, but are they producing anything more than lots of raw text? This paper argues that such an investment in digitization will be more valuable if accompanied by a parallel investment in highly structured resources such as dictionaries. Several examples, including some I worked on while at Perseus, illustrate this effect.

Bibliometric Impact Measures Leveraging Topic Analysis. Gideon Mann, David Mimno and Andrew McCallum. Joint Conference on Digital Libraries (JCDL) 2006, Chapel Hill, NC. PDF Powerpoint

When evaluating the impact of research papers, it's important to compare similar papers: a massively influential paper in Mathematics may be as well cited as a middling paper in Molecular Biology. We present a system that combines automatic citation analysis on spidered research papers with a new automatic topic model that is aware of multi-word terms. This system is capable of finding fine-grained sub-fields while scaling to the exponential increase in open-access publishing. We evaluate papers from the Rexa digital library using both traditional bibliometric statistics (substituting topics for journals) as well as several new metrics.

Hierarchical Catalog Records: Implementing a FRBR Catalog. David Mimno, Alison Jones and Gregory Crane. DLib, October 2005. HTML

Finding a Catalog: Generating Analytical Catalog Records from Well-structured Digital Texts. David Mimno, Alison Jones and Gregory Crane. Joint Conference on Digital Libraries (JCDL) 2005, Denver, CO. PDF.

Services for a Customizable Authority Linking Environment. Mark Patton and David Mimno. demonstration at Joint Conference on Digital Libraries (JCDL) 2004, Tucson, AZ.

Towards a Cultural Heritage Digital Library. Gregory Crane, Clifford E. Wulfman, Lisa M. Cerrato, Anne Mahoney, Thomas L. Milbank, David Mimno, Jeffrey A. Rydberg-Cox, David A. Smith, and Christopher York. Joint Conference on Digital Libraries (JCDL) 2003, Houston, TX.