Sparse online kernel logistic regression

In a previous post, I talked about an idea for sparsifying kernel logistic regression by using random prototypes. I also showed how the prototypes themselves (as well as the kernel parameters) can be updated. (Update Apr 2010. Slides for a tutorial on this stuff.)

(As a brief aside, I note that an essentially identical approach was used to sparsify Gaussian Process Regression by Snelson and Gharahmani. For GPR they use gradient ascent on the log-likelihood to learn the prototypes and labels, which is akin to learning the prototypes and betas for logistic regression. The set of prototypes and labels generated by their algorithm can be thought of as a pseudo training set.)

I recently (with the help of my super-competent Java developer colleague Hiroko Bretz) implemented the sparse kernel logistic regression algorithm. The learning is done in an online fashion (i.e., using stochastic gradient descent).

It seems to perform reasonably well on large datasets. Below I'll show its behavior on some pseudo-randomly generated classification problems.

All the pictures below are for logistic regression with the Gaussian RBF kernel. All data sets have 1000 examples from three classes which are mixtures of Gaussians in 2D (shown in red, blue and green). The left panel is the training data and the right panel are the predictions on the same data set by the learned logistic regression classifier. The prototypes are shown as black squares.

Example 1 (using 3 prototypes)


[caption id="attachment_368" align="alignleft" width="540" caption="After first iteration"][/caption]

[caption id="attachment_371" align="alignleft" width="539" caption="After second iteration"][/caption]

[caption id="attachment_373" align="alignleft" width="559" caption="After about 10 iterations"][/caption]

Although the classifier changes considerably from iteration to iteration, the prototypes do not seem to change much.

Example 2 (five prototypes)


[caption id="attachment_378" align="alignleft" width="540" caption="After first iteration"][/caption]

[caption id="attachment_379" align="alignleft" width="540" caption="After 5 iterations"][/caption]

Example 3 (five prototypes)


[caption id="attachment_380" align="alignleft" width="585" caption="After first iteration"][/caption]

The right most panel shows the first two "transformed features", i.e., the kernel values of the examples to the first two prototypes.

[caption id="attachment_381" align="alignnone" width="495" caption="After second iteration"][/caption]

Implementation details and discusssion

The algorithm runs through the whole data set to update the betas (fixing everything else), then runs over the whole data set again to update the  prototypes (fixing the betas and the kernel params), and then another time for the kernel parameter. These three update steps are repeated until convergence.

As an indication of the speed, it takes about 10 minutes until convergence with 50 prototypes, on a data set with a quarter million examples and about 7000 binary features (about 20 non-zero features/example).

I had to make some approximations to make the algorithm fast -- the prototypes had to be updated lazily (i.e., only the feature indices that have the feature ON are updated), and the RBF kernel is computed using the distance only along the subspace of the ON features.

The kernel parameter updating worked best when the RBF kernel was re-parametrized as $latex K(x,u) = exp(-exp(\theta) ||x-u||^2)$.

The learning rate for betas was annealed, but those of the prototypes and the kernel parameter was fixed at a constant value.

Finally, and importantly, I did not play much with the initial choice of the prototypes. I just picked a random subset from the training data. I think more clever ways of initialization will likely lead to much better classifiers. Even a simple approach like K-means will probably be very effective.

Comments

Popular posts from this blog

Incremental complexity support vector machine

An effective kernelization of logistic regression

The Cult of Universality in Statistical Learning Theory