18. September 2014

Labels and Patterns

I’ve been using this blog as a more philosophical platform, this is going to be about some new features in the machine learning package that I work on, Mallet. One of these, LabeledLDA, is some code that I’ve had lying around for a few years. The other, stop patterns, is a simple addition that may be useful in vocabulary curation for text mining. You’ll need to grab the latest development version from GitHub to run these.

more

19. August 2014

Data carpentry

The New York Times has an article titled For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights. Mostly I really like it. The fact that raw data is rarely usable for analysis without significant work is a point I try hard to make with my students. I told them “do not underestimate the difficulty of data preparation”. When they turned in their projects, many of them reported that they had underestimated the difficulty of data preparation. Recognizing this as a hard problem is great.

What I’m less thrilled about is calling this “janitor work”. For one thing, it’s not particularly respectful of custodians, whose work I really appreciate. But it also mischaracterizes what this type of work is about. I’d like to propose a different analogy that I think fits a lot better: data carpentry.

more

14. August 2014

A useful word

I was reading a paper the other day and came across the word aleatory. This turns out to be an excellent word. It comes from the Latin alea for “dice”, as in alea jacta est, which is what you say when you’re Julius Caesar and you cross the Rubicon. It means random, or subject to chance. It seems to come up mainly in legal contexts: an aleatory contract is one whose terms depend on future events, like an insurance policy. This got me thinking about other words for the property of randomness.

more