I’ve been using this blog as a more philosophical platform, this is going to be about some new features in the machine learning package that I work on, Mallet. One of these, LabeledLDA, is some code that I’ve had lying around for a few years. The other, stop patterns, is a simple addition that may be useful in vocabulary curation for text mining. You’ll need to grab the latest development version from GitHub to run these.
LabeledLDA
The first tool, LabeledLDA, is a simplified topic model for use when documents have labels, and labels correspond to topics. When each document in a text collection has exactly one label, you can estimate a Naïve Bayes model by counting all the words in the subset of documents that have a given label value. When you have no labels, a topic model like Latent Dirichlet Allocation can find semantically coherent “topics”. But what if you have multiple labels per document? For example, consider a 5-star review of a Chinese restaurant. We can think of this as a document with three labels: 5-star, Chinese, and Restaurant. Some of the words in the review might be about being a great restaurant, others might be about Chinese cuisine, and others about restaurants in general. LabeledLDA (Ramage et al., EMNLP 2009; see also McCallum, Multi-label text classification…, 1999) is a way of measuring this kind of connections between words and labels.
LabeledLDA asserts a one-to-one relationship between labels and topics. Every document is a mixture of the words associated with its labels, and nothing else. This assumption simplifies inference because in a given document we only have to assign each word token to one of the labels for that document, rather than the entire set of documents. An additional advantage is that each topic has, by definition, an existing label, so we can skip the “squinting at it” stage of fully unsupervised topic model analysis. The cost of that assumption is that we may miss emergent patterns or language that’s not well modeled by the labels. There has been work on models that make looser connections between labels and topics, but that’s a question for another day.
Running LabeledLDA requires that we have documents with multiple labels. The easiest way to do this in Mallet is with the new “—label-as-features” option in the “import-file” command. The input for this command is a text file with one line per document and three columns: an ID, a set of labels, and the text. By default, Mallet interprets the first two whitespace-delimited fields of each line as the ID and label. That’s a problem because we’d like to use some whitespace to delimit our multiple labels, but it’s easy to fix by changing the line-parsing regular expression to separate columns with tabs, and thus allow spaces to exist in the label field. Here’s a full command line example:
bin/mallet import-file --input yelp-short.txt --output yelp-short.seq --stoplist-file yelp.stops --label-as-features --keep-sequence --line-regex '([^\t]+)\t([^\t]+)\t(.*)'
As an example I’m going to use the fourth round Yelp dataset —- it’s freely available, but please go to the source. As labels or “features”, I’m using the tags, city and region for each business, and the review stars. From a subset of 100,000 reviews, this results in 553 labels. Here’s the command to run a model and save “topic keys” to a file:
bin/mallet run cc.mallet.topics.LabeledLDA --input yelp-short.seq --output-topic-keys yelp-llda.keys
The command supports most of the input and output options from standard Mallet LDA, in many cases with the exact same code. Use the “—help” option to find out more. Sharing code makes it easy to add functionality, but sometimes causes problems —- for example the held-out probability evaluator and topics-for-new-documents inferencer don’t pay attention to document labels.
The “keys” file contains one line per label/topic. The first column is the topic ID number, corresponding to the original order that each label first appeared in the training data. The second column is the label string. The third column is the total number of tokens assigned to that topic at the particular Gibbs sampling state where we stopped. The last column is a space-delimited list of 20 words in descending order by probability in the topic. Topics with very few total words may have less than 20 words.
Here are some examples of high-frequency topics:
10 Restaurants 727089 salad sauce meal cheese table dinner wasn't side server bit both flavor taste took served tasted something tasty hot though
232 Sushi_Bars 38441 sushi roll rolls fish happy hour tuna spicy sashimi salmon favorite sakana quality nigiri tempura chefs places japanese special rice
29 Hotels_&_Travel 30017 room hotel stay desk rooms front stayed day bed marriott check pool property booked area clean airport car checked resort
32 Hotels 29163 room hotel stay breakfast rooms free clean pool bed stayed comfortable bathroom area shower inn suites desk hotels hot water
16 Chinese 51834 chinese rice fried soup beef egg sauce hot dishes pork shrimp sour spicy dish crab orange noodles delivery china mein
Moderate-frequency topics are also pretty good, although there are some specific words mixed in like “Zoe’s” for “Southern”:
477 Dim_Sum 730 dim sum c-fu dimsum carts phoenix cart variety dim-sum feet palace balls items xmas tripe tradition leaf lotus steamed tarts
298 Office_Equipment 728 office staples max ink officemax print printer cartridge toner printing cartridges supplies printed router copy computer file copies depot computers
113 Videos_&_Video_Game_Rental 724 movies video movie blockbuster netflix films rent paradise releases rentals selection section games title rental foreign dvd titles film adult
468 Apache_Junction_AZ 721 apache junction elvira's lake canyon patio bloody mechanic girlfriend mary gringos lglg&c trail hiking miles beautiful truck hacker's locos cantina
155 Home_Decor 659 tile candles boone granite holland slab stone rocks kitchen slabs wood pewter minerals stones rock world decorate molding marling linen
234 Southern 657 zoe's sandwich cake salad zoes feta chocolate slaw pasta grilled healthy tuna potato kabobs dry limeade chips pimento sandwiches roll-ups
471 Gelato 648 gelato angel sweet flavors chocolate dark cotta panna italy hazelnut peanut flavor super texture taste creamy size dairy sweet's gelatos
Rare topics are increasingly noisy:
423 Interior_Design 8 copehaigen whatzits registries adhd distain untrained charlotte step
410 Airlines 8 unshaven sweatshirt csa eqipment scruffy hooded fbo vending
380 Russian 8 fsu embarrasses semi-salted latvia selection.i yearning jewish
552 Divorce_&_Family_Law 6 pontoni attornies results.there jea retained
506 Motorcycle_Rental 6 registration breach brags corky harley
504 Tutoring_Centers 6 peripherals really-really-smart clock-mini-speaker reassurances pro's
The star-rating labels are particularly interesting.
4 5_star 255897 amazing years every recommend awesome family excellent highly everything favorite wonderful times prices many can't fantastic feel most everyone day
6 4_star 204279 times prices usually area location bit its most price many every find years lot though favorite enjoy you're recommend excellent
11 3_star 92213 decent nothing bad bit prices stars though times average its area overall price however lot location most wasn't probably give
5 2_star 85203 minutes asked told another table took bad wasn't times server last manager finally waited not
7 1_star 213432 told asked minutes manager customer another rude called him took worst bad money business horrible left his call give finally
To me these look great. Five-star reviews have lots of positive adjectives, one-star reviews are narratives about employees who were rude or slow, and three-star words seem to capture the essence of ho-hum averageness. But they’re also not that different from the overall sub-corpus frequencies. The top words for five-star reviews (minus stopwords) are amazing, years, every, recommend, awesome, favorite, everything, and day. The most significant difference seems to be family, which is the 19th most common word and occurs 25% less often than everything, but ranks higher within this five-star topic. This result is stable —- running the algorithm again with a random initialization results in essentially the same order with a few word pairs swapped.
Stop Patterns
There was a request on the topic list for a way to remove patterns of words rather than just specific stopwords. I’ve added code to do that.
Here’s a sample regular expression file. Note that I’m using the Java matches
function, so a pattern must account for the entire token —- thus the trailing .*.
.*ly
photog.*
.*\.{2,}.*
The first is an attempt to remove English adverbs, but may not be ideal for people named Kelly or my friends at Yummly. The second looks for a prefix. The third looks for multiple dots, which seems to be a common pattern in some Yelp reviews. If there is a problem with a regular expression, Mallet will print an error and will ignore all subsequent patterns, but will not stop processing. Here’s a sample file with examples:
1 X this is a terribly useful test for....things involving photographs, photographers, and photography called test.txt
Here’s what happens when I run the standard method:
% bin/mallet import-file --input test.txt --print-output
name: 1
target: X
input: this(0)=1.0
terribly(1)=1.0
useful(2)=1.0
test(3)=1.0
for....things(4)=1.0
involving(5)=1.0
photographs(6)=1.0
photographers(7)=1.0
and(8)=1.0
photography(9)=1.0
called(10)=1.0
test.txt(11)=1.0
Notice that the new default token pattern ignores one- and two-letter words (is, a), but includes tokens that contain punctuation, which picks up URLs, apostrophes, and hyphens. Keep in mind these interactions with the tokenization pattern when designing regular expressions. Here’s what the output looks like afterwards.
% bin/mallet import-file --input test.txt --stop-pattern-file stoplists/patterns.txt --print-output
name: 1
target: X
input: this(0)=1.0
useful(1)=1.0
test(2)=1.0
involving(3)=1.0
and(4)=1.0
called(5)=1.0
test.txt(6)=1.0
Using stop patterns involves creating a Matcher object for each token and each pattern, and checking that pattern against the token string. In contrast, static stopword removal is a hash lookup. I haven’t measured performance yet, but I would guess that stop patterns should be used sparingly.