Online logistic regression
I like Bob Carpenter's paper -- "Lazy sparse online logistic regression ...". In addition to being a nice overview of logistic regression, it describes online training for logistic regression by stochastic gradient descent under various parameter priors.
Another cool feature is that if the feature dimensionality is large but the examples are sparse, only the parameters corresponding to the features that are non-zero (for the current example) need to be updated (this is the lazy part). It is super easy to implement (a few hundred lines in C, for an svm_light like stand-alone application) and trains very fast, as attested to by Leon Bottou.
There is one issue about the regularization discount in a truly online setting where there is no "end of epoch", which was discussed by Carpenter. He suggests leaving it at a constant, which, as he points out, corresponds to steadily decreasing the variance of the prior with the number of examples.
In my implementation I used 1/(N_INIT+NumExamplesSeenThusFar), where N_INIT is some constant (say 100). The effect of this is that as the dataset becomes large the prior is ignored, as it should be. However, the earlier examples contribute less to the parameter estimates than later ones.
To allow for better representation capability with sufficient data, I implemented the polynomial degree 2 kernel by an explicit higher dimensional feature map. This is either cumbersome or impossible for other kernels. I will discuss a more general kernelization of the method in a later post.
(Update April 04 2010. Some slides for a basic tutorial on logistic regression, online learning, kernelization and sequence classification.)
Another cool feature is that if the feature dimensionality is large but the examples are sparse, only the parameters corresponding to the features that are non-zero (for the current example) need to be updated (this is the lazy part). It is super easy to implement (a few hundred lines in C, for an svm_light like stand-alone application) and trains very fast, as attested to by Leon Bottou.
There is one issue about the regularization discount in a truly online setting where there is no "end of epoch", which was discussed by Carpenter. He suggests leaving it at a constant, which, as he points out, corresponds to steadily decreasing the variance of the prior with the number of examples.
In my implementation I used 1/(N_INIT+NumExamplesSeenThusFar), where N_INIT is some constant (say 100). The effect of this is that as the dataset becomes large the prior is ignored, as it should be. However, the earlier examples contribute less to the parameter estimates than later ones.
To allow for better representation capability with sufficient data, I implemented the polynomial degree 2 kernel by an explicit higher dimensional feature map. This is either cumbersome or impossible for other kernels. I will discuss a more general kernelization of the method in a later post.
(Update April 04 2010. Some slides for a basic tutorial on logistic regression, online learning, kernelization and sequence classification.)
Thanks for the pointer to Leon Bottou, I plan to check out his Lush language.. sounds interesting.
ReplyDelete