--momentum

This setting is currently ignored, I'm going to remove it in the next update.

--num-topics

Total number of topics, as with standard LDA.

--num-samples
--sample-burn-in

These two settings control what we do with each document. When we look the words in a document, we estimate the probability that each word is assigned to each topic, given the current setting of the topics. We use that probability to update the topics.

Theoretically, if there are N word tokens in the document and K topics, that should mean we need N K-dimensional distributions -- one probability value for every topic given every word. Even writing down all those numbers would take a long time if K gets large. Worse, most of those numbers will be very, very close to zero.

We can do a much better job by giving each word token a specific topic assignment and "sweeping" through the word tokens a small number of times, resetting each token's topic as we pass through, conditioned on the state of all the other tokens. Then once we're done, we count how many times word w was assigned to topic k, divide by the number of sweeps, and use that approximate probability to update the topics.

The first option, --num-samples, controls how many times we "sweep" through the tokens in a document, resampling a topic assignment for each token. More sweeps, and we get a better estimate of the probability of topics given words, but we also spend more time. Five sweeps seems reasonable.

This number of sweeps includes an initialization pass. We expect that the assignments in that first sweep won't be as good as later ones, so it's usual to discard the first few sweeps. The --sample-burn-in option controls how many sweeps are discarded before we start recording topic assignments.

--batch-size

In the online algorithm, alternate between (1) selecting a random subset of the corpus (the "mini-batch") and inferring topic assignments for the words in those documents given the current setting of the topics and (2) updating the topics given the topic assignments from the mini-batch.

This option is the number of documents we select randomly between each update to the topics. Performance seems fairly insensitive to this setting. 100--200 is probably about right.

--learning-rate
--learning-rate-exponent

After we process each mini-batch, we collect the topic-word assignments and use them to update the topics. How much we step in the direction of the gradient depends on the learning rate. As we see have seen more data and the model starts to be better, we want to take smaller steps. If the learning rate is too big, we may go too far in one direction and reach an unstable configuration. If we take steps that are too small, we'll never improve.

The standard formula is (t_0 + t)^-kappa.

The option --learning-rate specifies t_0 (it's really more of an offset than the learning rate, the name of the option may change soon). If this number is 0, the model will take large steps at first (ie 1.0, 0.5, 0.33, ...) but quickly start taking smaller steps. The initial iterations will make a big difference, and later iterations will be less significant. If the number is larger, the model will start with smaller values, but each iteration will be of more even importance.

The option --learning-rate-exponent is kappa in the above equation. It also controls how quickly the learning rate changes. Values between 0.5 and 1.0 are theoretically justified. If the value is 1.0, the learning rate gets small fast, while if it is 0.5, the learning rate stays more flat.