Computational Humanities. ed. Jessica Marie Johnson, Lauren Tilton, David Mimno. Debates in Digital Humanities, 2024.
Humanities and Human-Centered Machine Learning. Laure Thompson and David Mimno. In Human-centered Machine Learning, 2023. preprint
A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity. Shayne Longpre, Gregory Yauney, Emily Reif, Katherine Lee, Adam Roberts, Barret Zoph, Denny Zhou, Jason Wei, Kevin Robinson, David Mimno, Daphne Ippolito. arXiv
Modeling Legal Reasoning: LM Annotation at the Edge of Human Agreement. Rosamond Elizabeth Thalken, Edward Stiglitz, David Mimno, Matthew Wilkens. EMNLP, 2023.
Data Similarity is Not Enough to Explain Language Model Performance. Gregory Yauney, Emily Reif, David Mimno. EMNLP, 2023.
Hyperpolyglot LLMs: cross-lingual interpretability in token embeddings. Andrea W Wang, David Mimno. EMNLP, 2023.
The Chatbot and the Canon: Poetry Memorization in LLMs. Lyra D'Souza, David Mimno. CHR 2023.
T5 meets Tybalt: author attribution in Early Modern English drama using large language models. Rebecca Hicke, David Mimno. CHR 2023.
Contextualized Topic Coherence Metrics. Hamed Rahimi, Jacob Louis Hoover, David Mimno, Hubert Naacke, Camelia Constantin, Bernd Amann. arXiv
Honest students from untrusted teachers: Learning an interpretable question-answering pipeline from a pretrained language model. Jacob Eisenstein, Daniel Andor, Bernd Bohnet, Michael Collins, David Mimno. arXiv
Breaking BERT: Evaluating and Optimizing Sparsified Attention. Siddhartha Brahma, Polina Zablotskaia, David Mimno. Sparsity in Neural Networks, 2021 arXiv
Comparing text representations: A theory-driven approach. Gregory Yauney, David Mimno. EMNLP 2021. PDF
Tecnologica cosa: Modeling Storyteller Personalities in Boccaccio's Decameron. A Feder Cooper, Maria Antoniak, Christopher De Sa, Marilyn Migiel, David Mimno. LaTech-CLfL 2021. arXiv
On-the-Fly Rectification for Robust Large-Vocabulary Topic Inference. Moontae Lee, Sungjun Cho, Kun Dong, David Mimno, David Bindel. ICML, 2021
PDF
Separating the wheat from the chaff: A topic and keyword-based procedure for identifying research-relevant text. Alicia Eads, Alexandra Schofield, Fauna Mahootian, David Mimno, and Rens Wilderom. Poetics, 2021. PDF
Like Two Pis in a Pod: Author Similarity Across Time in the Ancient Greek Corpus. Grant Storey and David Mimno. Journal of Cultural Analytics. July 2020. HTML
Prior-aware Composition Inference for Spectral Topic Models. Moontae Lee, David Bindel, and David Mimno. AIStats 2020.
Combatting The Challenges of Local Privacy for Distributional Semantics with Compression. Alexandra Schofield, Gregory Yauney, and David Mimno. PriML Workshop, NeurIPS, 2019. PDF
Narrative Paths and Negotiation of Power in Birth Stories.
Maria Antoniak, David Mimno, and Karen Levy. CSCW, 2019, Austin, TX. PDF
Online birth stories — people's narrative descriptions of giving birth — provide a valuable resource for the study of narrative. Each is unique, yet they all share a similar plot structure and cast of characters. Analyzing the language of "character" interactions provides a view into power and agency.
Unsupervised Discovery of Multimodal Links in Multi-image, Multi-sentence Documents.
Jack Hessel, Lillian Lee, and David Mimno.
EMNLP, 2019, Hong Kong, China. PDF
Datasets that have text captions for images are relatively rare.
It's much more common to find documents that have text, and images, but
no specific link between them. We show that using only a large set of
such multi-modal documents, we can create a system that can predict
connections between specific sentences and images.
Practical Correlated Topic Modeling and Analysis via the Rectified Anchor Word Algorithm.
Moontae Lee, Sungjun Cho, David Bindel, and David Mimno.
EMNLP, 2019, Hong Kong, China. PDF
Boosted negative sampling by quadratically constrained entropy maximization.
Taygun Kekeç, David Mimno, and David M.J. Tax.
Pattern Recognition Letters
Volume 125, 1 July 2019, Pages 310-317.
HTML PDF
Why does negative sampling work? We define some properties of optimal negative sampling distributions, and show that existing heuristic distributions are approximations to optimal distributions.
Computational Cut-Ups: The Influence of Dada.
Laure Thompson and David Mimno.
The Journal of Modern Periodical Studies
vol. 8, no. 2, 2018, pp. 179–195.
JSTOR
preprint
Can we produce a Dada reading of Dada itself? We used a pre-trained convolutional neural network to generate a computational "definition" of Dada visual art from a collection of modernist journals. Looking at the "success" and "failure" of the resulting image classifier provides insights into the Dada aesthetic, but also into the visual distinctions that are afforded by CNNs.
Authorless Topic Models: Biasing Models Away from Known Structure.
Laure Thompson and David Mimno.
COLING, 2018.
[Winner: Best NLP Engineering Experiment]
PDF Code
Unsupervised clustering methods find the strongest patterns in collections. But what if the clearest pattern is something you already know? For example, topic models trained on novels don't find themes, they find book series. We show that you can predict which words correlate with known metadata, and that simple, easily auditable transformations to input text can change model output in useful, predictable ways.
Quantifying the Visual Concreteness of Words and Topics in Multimodal Datasets. Jack Hessel, David Mimno, and Lillian Lee. NAACL 2018.
PDF
We expect images and text to convey related but distinct information, but how much connection is there, and how much does it vary? Certain words map more directly to image representations than others: it's a lot easier to predict the appearance of a picture marked "squirrel" than "beautiful".
Evaluating the Stability of Embedding-based Word Similarities.
Maria Antoniak and David Mimno.
TACL (6) 2018, 107--119 PDF
Word embeddings are unstable, especially for smaller collections.
Lists of most-similar words can vary considerably between random initializaions, even for methods that appear deterministic.
For more reliable results, train embeddings on at least 25 bootstrap-sampled corpora and calculate average word similarities.
Quantifying the Effects of Text Duplication on Semantic Models.
Alexandra Schofield, Laure Thompson, and David Mimno.
EMNLP 2017 PDF
The Strange Geometry of Skip-Gram with Negative Sampling.
David Mimno and Laure Thompson.
EMNLP 2017 PDF (best paper honorable mention)
Most people assume that word embedding vectors are determined by semantics.
In fact, in the popular SGNS algorithm popularized by word2vec the negative sampling
objective dominates, resulting in vectors that lie within a narrow cone.
Applications of Topic Models.
Jordan Boyd-Graber, Yuening Hu, and David Mimno.
Foundations and Trends in Information Retrieval, Now Publishers. 2017.
Comparing Grounded Theory and Topic Modeling: Extreme Divergence or Unlikely Convergence? Eric P. S. Baumer, David Mimno, Shion Guha, Emily Quan, Geri K. Gay. JASIST 68(6) 2017. PDF
We draw parallels between two methods for building high-level concepts that are grounded in
documents. Both provide a useful perspective on the other.
Pulling Out the Stops: Rethinking Stopword Removal for Topic Models.
Alexandra Schofield, Måns Magnusson, and David Mimno.
EACL 2017 PDF
Removing the dozen or so most frequent words in a corpus seems to have a big effect on topic models.
Beyond that, however, it's hard to tell the difference between models that had stopwords removed before
and after training.
The Tell-Tale Hat: Surfacing the Uncertainty in Folklore Classification.
Peter M. Broadwell, David Mimno and Timothy R. Tangherlini.
Cultural Analytics, February 2017.
HTML
Cats and Captions vs. Creators and the Clock: Comparing Multimodal Content to Context in Predicting Relative Popularity.
Hessel, Jack, Lillian Lee, and David Mimno.
WWW 2017
If you want to predict the popularity of online content, you need to consider social effects and the time of posting. In some cases a matter of seconds can be significant.
Beyond Exchangeability: The Chinese Voting Process.
Moontae Lee, Seok Hyun Jin, and David Mimno.
NIPS 2016
Helpfulness votes in Amazon and StackExchange forums are sensitive to social pressure. Accounting for forum-specific biases helps identify over- and under-rated content, and quantifies forum culture.
Comparing Apples to Apple: The Effects of Stemmers on Topic Models.
Alexandra Schofield and David Mimno.
Transactions of the Association for Computational Linguistics 4 (2016): 287-300. PDF
Using a stemmer is unlikely to improve topic model results. If you're worried about displaying
small variations of the same word over and over, use the stemmer after training to group words together for display.
Machine Learning and Grounded Theory Method: Convergence, Divergence, and Combination.
Michael Muller, Shion Guha, Eric P. S. Baumer, David Mimno, and N. Sadat Shami.
GROUP 2016.
PDF
Missing Photos, Suffering Withdrawal, or Finding Freedom? How Experiences of Social Media Non-Use Influence the Likelihood of Reversion.
Eric P. S. Baumer, Shion Guha, Emily Quan, David Mimno, Geri K. Gay. Social Media and Society. November 2015.
HTML
Posterior predictive checks to quantify lack-of-fit in admixture models of latent population structure. David Mimno, David M Blei, Barbara E Engelhardt. Proceedings of the National Academy of Sciences. 112(26):E3341–50, 2015.
PNAS arXiv preprint
Admixture models represent the alleles in a genome as a combination of latent ancestral populations. We apply posterior predictive checks to evaluate the quality of fit with respect to functions of interest to population biologists.
How Social Media Non-use Influences the Likelihood of Reversion: Self Control, Being Surveilled, Feeling Freedom, and Socially Connecting. Eric P. S. Baumer, Shion Guha, Emily Quan, David Mimno, and Geri Gay. Social Media and Society, 2015.
Robust Spectral Inference for Joint Stochastic Matrix Factorization. Moontae Lee, David Bindel, and David Mimno. NIPS, 2015, Montreal, QC, Canada.
Evaluation methods for unsupervised word embeddings. Tobias Schnabel, Igor Labutov, David Mimno, and Thorsten Joachims. EMNLP, 2015, Lisbon, Portugal. PDF Data
Word embeddings appear to represent meaning with geometry. But evaluating
this property has been difficult. Getting humans to rate the semantic similarity
of words is error-prone and time consuming. We use a simpler method to collect
human judgments based on pairwise comparisons and odd-one-out detection tasks.
Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements Jordan Boyd-Graber, David Mimno, and David Newman. In Handbook of Mixed Membership Models and Their Applications, CRC/Chapman Hall, 2014.
PDF
A description of common pre-processing steps and model checking diagnostics.
Low-dimensional Embeddings for Interpretable Anchor-based Topic Inference Moontae Lee and David Mimno. EMNLP, 2014, Doha, Qatar. (Selected for oral presentation).
PDF
In this work we trade an approximate solution to an exact problem for an exact solution to an approximation. We use a proven method in data visualization, the t-SNE projection, to compress a high-dimensional word co-occurrence space to a visualizable two- or three-dimensional space, and then find an exact convex hull. The corners of this convex hull become the anchor words for topics.
We find better topics with more salient anchor words, while also improving the interpretability of the algorithm.
Significant Themes in 19th-Century Literature Matthew L. Jockers, David Mimno. Poetics, Dec 2013.
Preprint
Models of literature are usually used for exploratory data analysis, but they can also be used to evaluate specific conjectures. We use permutation tests, bootstrap tests, and posterior predictive checks to test some hypotheses about associations between gender, anonymity, and literary themes.
Random Projections for Anchor-based Topic Inference David Mimno. NIPS workshop on Randomized Methods, 2013.
PDF
Random projections allow us to scale anchor-based topic finding algorithms to large vocabularies. Projections with structured sparsity, for example holding the number of non-zeros in each row of the random projection fixed, produces better results.
A Practical Algorithm for Topic Modeling with Provable Guarantees
Sanjeev Arora, Rong Ge, Yonatan Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, Michael Zhu.
ICML, 2013, Atlanta, GA. (Selected for long-form presentation).
PDF
Spectral algorithms for LDA have been useful for proving bounds on learning, but it hasn't been clear that they are useful.
This paper presents an algorithm that both maintains theoretical guarantees and also provides extremely fast inference.
We compared this new algorithm directly to standard MCMC methods on a number of metrics for synthetic and real data.
Scalable Inference of Overlapping Communities
Prem Gopalan, David Mimno, Sean Gerrish, Michael J. Freedman, David Blei.
NIPS, 2012, Lake Tahoe, NV. (Selected for spotlight presentation)
Sparse stochastic inference for latent Dirichlet allocation
David Mimno, Matthew Hoffman and David Blei.
ICML, 2012, Edinburgh, Scotland. (Selected for long-form presentation).
PDF
Gibbs sampling can be fast if data is sparse, but doesn't scale because it requires us to keep a state variable for every data point. Online stochastic inference can be fast and uses constant memory, but doesn't scale because it can't leverage sparsity. We present a method that uses Gibbs sampling in the local step of a stochastic variational algorithm. The resulting method can process a 33 billion word corpus of 1.2 million books with thousands of topics on a single CPU.
Computational Historiography: Data Mining in a Century of Classics Journals David Mimno.
ACM J. of Computing in Cultural Heritage. 5, 1, Article 3 (April 2012), 19 pages.
PDF
Topic Models for Taxonomies Anton Bakalov, Andrew McCallum, Hanna Wallach, and David Mimno.
Joint Conference on Digital Libraries (JCDL) 2012, Washington, DC.
PDF
Database of NIH grants using machine-learned categories and graphical clustering
Edmund M Talley, David Newman, David Mimno, Bruce W Herr II, Hanna M Wallach, Gully A P C Burns, A G Miriam Leenders and Andrew McCallum, Nature Methods, Volume 8(7), June 2011, pp. 443--444. HTML PDF
What does the NIH fund, and how are scientific disciplines divided between institutes? In this paper we created a visualization of 100,000 accepted proposals and the 200,000 journal publications associated with those grants.
Reconstructing Pompeian Households David Mimno.
UAI, 2011, Barcelona, Spain. (selected for oral presentation)
PDF
Houses in Pompeii have several architecturally distinct types of rooms,
but it's not always clear what the function of these rooms was, or even
if there was a consistent pattern of use across different houses.
This work uses statistical models to predict the artifacts found
in different rooms.
Bayesian Checking for Topic Models David Mimno, David Blei.
EMNLP, 2011, Edinburgh, Scotland. (selected for oral presentation)
PDF
This paper measures the degree to which data fits the assumptions of topic models.
Topic models represent documents as mixtures of simple, static multinomial
distributions over words.
We know that real documents exhibit "burstiness": when a word occurs in a document, it tends to occur many times.
In this paper, we use a method from Bayesian model checking, posterior predictive checks, to measure the difference between the burstiness we observed and the expectation of the model.
We then use this method to search for clusterings of documents based on time or other observed groupings, that best explain the observed burstiness.
Optimizing Semantic Coherence in Topic Models David Mimno, Hanna Wallach, Edmund Talley, Miriam Leenders, Andrew McCallum.
EMNLP, 2011, Edinburgh, Scotland. (selected for oral presentation)
PDF
We introduce a metric for detecting semantic errors in topic models and develop a completely unsupervised model that specifically tries to improve this metric.
Topic models provide a useful method for organizing large document collections into a small number of meaningful word clusters.
In practice, however, many topics contain obvious semantic errors that may not reduce predictive power, but significantly weaken user confidence.
We find that measuring the probability that lower-ranked words in a topic co-occur in documents with higher-ranked words beats all current methods for detecting a large class of low-quality topics.
Measuring Confidence in Temporal Topic Models with Posterior Predictive Checks
David Mimno, David Blei.
NIPS Workshop on Computational Social Science and the Wisdom of Crowds, 2010, Whistler, BC.
Rethinking LDA: Why Priors Matter
Hanna Wallach, David Mimno and Andrew McCallum.
NIPS, 2009, Vancouver, BC.
PDF
Supplementary Material
Empirically, we have found that optimizing Dirichlet hyperparameters
for document-topic distributions in topic models makes a huge difference:
topics are not dominated by very common words and topics are more stable
as the number of topics increases. In this paper we explore the effects
of Dirichlet priors on topic models. The best structure seems to be
an asymmetric prior over document-topic distributions and a symmetric
prior over topic-word distributions, currently implemented in the
MALLET toolkit.
Reconstructing Pompeian Households
David Mimno. Applications of Topic Models Workshop, NIPS 2009, Whistler, BC.
House data
Artifact data
PDF (selected for oral presentation)
Pompeii provides a unique view into daily life in a Roman city, but
the evidence is noisy and incomplete. This work applies statistical
data mining methods originating in text analysis to
a database of artifacts found in 30 houses in Pompeii.
Polylingual Topic Models
David Mimno, Hanna Wallach, Jason Naradowsky, David Smith and Andrew McCallum.
EMNLP, 2009, Singapore.
PDF
Standard statistical topic models do not handle multiple languages well,
but many important corpora -- particularly outside scientific publications --
contain a mix of many languages. We show that with simple modifications,
topic models can leverage not only direct translations but also comparable
collections like Wikipedia articles. We demonstrate the system on European
parliament proceedings in 12 languages and comparable Wikipedia articles
in 14 languages.
Code is available in the
cc.mallet.topics.PolylingualTopicModel class in the
MALLET toolkit.
Evaluation Methods for Topic Models
Hanna Wallach, Iain Murray, Ruslan Salakhutdinov and David Mimno.
ICML, 2009, Montreal, Quebec.
PDF
Held-out likelihood experiments provide an important complement to
task-specific evaluations in topic models. We evaluate several methods
for calculating held-out likelihoods. Several previously used methods,
especially the harmonic mean method, show poor accuracy and high variance
compared to a "Chib-style" method and a particle filter-inspired method.
Efficient Methods for Topic Model Inference on Streaming Document Collections
Limin Yao, David Mimno and Andrew McCallum.
KDD, 2009, Paris, France.
PDF
slides on fast sampling
Statistical topic modeling has become popular in text processing,
but remains computationally intensive. It is often impossible to
run standard inference methods on collections because of limited space
(eg large IR corpora) and time (eg streaming corpora). In this paper
we evaluate a number of methods for lightweight online topic inference,
based on models trained from computationally expensive offline processes.
In addition, we present SparseLDA, a new data structure and algorithm
for Gibbs sampling in multinomial mixture models (such as LDA) that
offers substantial improvements in speed and memory usage. A parallelized
version of this algorithm is implemented in
MALLET.
Error: in section 3.4, the statement "The constant s only changes when we
update the hyperparameters α" is incorrect, as the number of words in
the old topic and the new topic change by one. In fact, s
must be updated before and after sampling a topic for each token, but this update
takes a constant number of operations, regardless of the number of topics.
This problem was only in the paper — the MALLET implementation has always been correct.
Polylingual Topic Models
David Mimno, Hanna Wallach, Limin Yao, Jason Naradowsky and Andrew McCallum.
Snowbird Learning Workshop, 2009, Clearwater, FL.
Classics in the Million Book Library
Gregory Crane, Alison Babeu, David Bamman, Thomas Breuel, Lisa Cerrato, Daniel Deckers, Anke Lüdeling, David Mimno, Rashmi Singhal, David A. Smith, Amir Zeldes. Digital Humanities Quarterly 3(1), Winter 2009.
HTML
In October 2008, Google announced a settlement that will provide access to seven million scanned books while the number of books freely available under an open license from the Internet Archive exceeded one million. The collections and services that classicists have created over the past generation place them in a strategic position to exploit the potential of these collections. This paper concludes with research topics relevant to all humanists on converting page images to text, one language to another, and raw text into machine actionable data.
Gibbs Sampling for Logistic Normal Topic Models
with Graph-Based Priors
David Mimno, Hanna Wallach and Andrew McCallum.
NIPS Workshop on Analyzing Graphs, 2008, Whistler, BC. (one of five
out of 22 papers selected for oral presentation)
PDF
Dirichlet distributions are a mathematically tractable prior distribution
for mixing proportions in Bayesian mixture models, but their convenience
comes at the cost of flexibility and expressiveness. Previous work has
suggested alternative priors such as logistic normal distributions, extending
topic mixture models with covariance matrices and dynamic linear models,
but this work has been limited to variational approximations.
This paper presents a method for simple, robust
Gibbs sampling in logistic normal topic models using an auxiliary variable
scheme. Using this method, we extend previous models over linear chains
to Gaussian Markov random field priors with arbitrarily structured graphs.
Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression
David Mimno and Andrew McCallum.
UAI, 2008 (selected for plenary presentation)
PDF data
Text documents are usually accompanied by metadata, such as the authors,
the publication venue, the date, and any references. Work in topic modeling
that has taken such information into account, such as Author-Topic,
Citation-Topic, and Topic-over-Time models, has generally focused on
constructing specific models that are suited only for one particular type
of metadata. This paper presents a simple, unified model for learning
topics from documents given arbitrary non-textual features, which can be
discrete, categorical, or continuous.
Modeling Career Path Trajectories
David Mimno and Andrew McCallum.
University of Massachusetts, Amherst Technical Report #2007-69, 2007.
PDF
Descriptions of previous work experience in resumes are a valuable source of
information about the structure of the job market and the economy. There is,
however, a high degree of variability in these documents.
Job titles are a particular problem, as they are often either overly sparse
or overly general:
85% of job titles in our corpus occur only once, while the most common titles, such as "Consultant", are so broad as to be virtually meaningless.
We use a hierarchical hidden state model to discover
clusters of words that correspond to distinct skills, clusters of skills
that correspond to jobs, and transition patterns between jobs.
Community-based Link Prediction with Text
David Mimno, Hanna Wallach, and Andrew McCallum.
Statistical Network Modeling Workshop, NIPS, 2007, Whistler, BC.
Expertise Modeling for Matching Papers with Reviewers
David Mimno and Andrew McCallum.
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) 2007, San Jose, CA.
PDF
Data
Science depends on peer review, but matching papers with reviewers is a
challenging and time consuming task. We compare several automatic methods for
measuring the similarity between a submitted abstract and papers previously
written by reviewers. These include a novel topic model that automatically
divides an author's papers into topically coherent "personas".
Probabilistic Representations for Integrating Unreliable Data Sources
David Mimno, Andrew McCallum and Gerome Miklau.
IIWeb workshop at AAAI 2007, Vancouver, BC, Canada.
PDF
Mixtures of Hierarchical Topics with Pachinko Allocation.
David Mimno, Wei Li and Andrew McCallum.
International Conference on Machine Learning (ICML) 2007, Corvallis, OR.
PDF
The four-level pachinko allocation model
(PAM) (Li & McCallum, 2006) represents
correlations among topics using a DAG structure. It does not, however, represent a
nested hierarchy of topics, with some topical word distributions representing the vocabulary that is shared among several more
specic topics. This paper presents hierarchical PAM — an enhancement that explicitly represents a topic hierarchy. This model
can be seen as combining the advantages of
hLDA's topical hierarchy representation with
PAM's ability to mix multiple leaves of the
topic hierarchy. Experimental results show
improvements in likelihood of held-out documents, as well as mutual information between
automatically-discovered topics and human-generated categories such as journals.
Mining a digital library for influential authors.
David Mimno and Andrew McCallum.
Joint Conference on Digital Libraries (JCDL) 2007, Vancouver, BC, Canada.
PDF
Most digital libraries let you search for documents, but we often want to
search for people as well. We extract and disambiguate author names from
online research papers, weight papers using PageRank on the citation graph,
and expand queries using a topic model. We evaluate the system by comparing
people returned for the query "information retrieval" to recipients of
major awards in IR.
Organizing the OCA: Learning faceted subjects from a library of digital books.
David Mimno and Andrew McCallum.
Joint Conference on Digital Libraries (JCDL) 2007, Vancouver, BC, Canada.
PDF
The Open Content Alliance is one of several large-scale digitization projects
currently producing huge numbers of digital books. Statistical topic models
are a natural choice for organizing and describing such large text corpora,
but scalability becomes a problem when we are dealing with multi-billion
word corpora. This paper presents a new method for topic modeling, DCM-LDA.
In this model, we train an independent topic model for every book, using
pages as "documents". We then gather the topics discovered, cluster them,
and then fit a Dirichlet prior for each topic cluster. Finally, we retrain
the individual book topic models using these new shared topics.
Beyond Digital Incunabula: Modeling the Next Generation
of Digital Libraries.
Gregory Crane, David Bamman, Lisa Cerrato, Alison Jones, David Mimno, Adrian Packel, D. Sculley, and Gabriel Weaver.
European Conference on Digital Libraries (ECDL) 2006, Alicante, Spain.
PDF
Several groups are currently embarking on large scale digitization projects,
but are they producing anything more than lots of raw text? This paper argues
that such an investment in digitization will be more valuable if accompanied
by a parallel investment in highly structured resources such as dictionaries.
Several examples, including some I worked on while at Perseus, illustrate
this effect.
Bibliometric Impact Measures Leveraging Topic Analysis.
Gideon Mann, David Mimno and Andrew McCallum.
Joint Conference on Digital Libraries (JCDL) 2006, Chapel Hill, NC.
PDF
Powerpoint
When evaluating the impact of research papers, it's important to compare
similar papers: a massively influential paper in Mathematics may be as
well cited as a middling paper in Molecular Biology. We present a system
that combines automatic citation analysis on spidered research papers
with a new automatic topic model that is aware of multi-word terms. This
system is capable of finding fine-grained sub-fields while scaling to the
exponential increase in open-access publishing. We evaluate papers from the
Rexa digital library using both
traditional bibliometric statistics (substituting topics for journals) as
well as several new metrics.
Hierarchical Catalog Records: Implementing a FRBR Catalog.
David Mimno, Alison Jones and Gregory Crane.
DLib, October 2005. HTML
Finding a Catalog: Generating Analytical Catalog Records from Well-structured Digital Texts.
David Mimno, Alison Jones and Gregory Crane.
Joint Conference on Digital Libraries (JCDL) 2005, Denver, CO.
PDF.
Services for a Customizable Authority Linking Environment.
Mark Patton and David Mimno.
demonstration at Joint Conference on Digital Libraries (JCDL) 2004, Tucson, AZ.
Towards a Cultural Heritage Digital Library.
Gregory Crane, Clifford E. Wulfman, Lisa M. Cerrato, Anne Mahoney,
Thomas L. Milbank, David Mimno, Jeffrey A. Rydberg-Cox, David A.
Smith, and Christopher York. Joint Conference on Digital Libraries (JCDL) 2003, Houston, TX.