Many people have found topic modeling a useful way to explore large text collections. Unfortunately, running your own models usually requires installing statistical tools like R or Mallet. The goals of this project are to (a) make running topic models easy for anyone with a modern web browser, (b) demonstrate the potential of statistical computing in Javascript and (c) allow tighter integration between models and web-based visualizations.
Instructions:
When you open the page it will load a file containing documents and a file containing stopwords. The default is a corpus of paragraphs from US State of the Union speeches. It is large enough to get interesting results but small enough to train quickly.
All words have initially been assigned randomly to topics. Click the "Run 50 iterations" button to start training. The iteration count will increase each time the algorithm passes through the dataset.
The topics on the right side of the page should now look more interesting. Run more iterations if you would like -- there's probably still a lot of room for improvement after only 50 iterations.
Once you're satisfied with the model, you can click on a topic from the list on the right to sort documents in descending order by their use of that topic. Proportions are weighted so that longer documents will come first. You can also explore correlations between topics by clicking the "Topic Correlations" tab. Pairs of topics that are correlated will appear as blue circles, pairs that are anti-correlated will appear as red circles.
Using your own documents:
If you would like to explore your own collection, you can upload documents and stopword list files directly to the browser. No data is sent over the internet. Remember that "document" really means "segment of text". A few hundred words is a good length; longer passages tend to shift their topical focus, making inference more difficult. The format for the documents file is one document per line, with each line consisting of
[doc ID] [tab] [label] [tab] [text...]
(this is the default format for Mallet). The values in the "label" field are treated as a sequence of categories, which are shown in the "timeseries" tab in the order they appear in the documents file.
The format for stopwords is one word per line. The "Vocabulary" tab allows you to dynamically add and remove stopwords, and shows which words appear in many topics and which are more specific. Unicode is supported, so most languages that have meaningful whitespace (ie not CJK) should work.
To save data from a trained model, go to the downloads tab. The links on this page generate files from your browser, again, no data is sent over the internet.
jsLDA works in Chrome, Safari, and Firefox.