An effective kernelization of logistic regression

I will present a sparse kernelization of logistic regression where the prototypes are not necessarily from the training data.

Traditional sparse kernel logistic regression

Consider an $latex M$ class logistic regression model given by

$latex P(y|x)\propto\mbox{exp}(\beta_{y0} + \sum_{j}^{d}\beta_{yj}x_j)$ for $latex y =0,1,\ldots,M$

where $latex j$ indexes the $latex d$ features.

Fitting the model to a data set $latex D = \{x_i, y_i\}_{i=1,\ldots,N}$ involves estimating the betas to maximize the likelihood of $latex D$.

The above logistic regression model is quite simple (because the classifier is a linear function of the features of the example), and in some circumstances we might want a classifier that can produce a more complex decision boundary. One way to achieve this is by kernelization. We write

$latex P(y|x) \propto \mbox{exp}(\beta_{y0} + \sum_{i=1}^N \beta_{yi} k(x,x_i))$ for $latex y=0,1,\ldots,M$.

where $latex k(.,.)$ is a kernel function.

In order to be able to use this classifier at run-time we have to store all the training feature vectors as part of the model because we need to compute the kernel value of the test example to every one of them. This would be highly inefficient, not to mention the severe over-fitting of the model to the training data.

The solution to both the test time efficiency and the over-fitting problems is to enforce sparsity. That is we somehow make sure that $latex \beta_{yi} =0$ for all but a few examples $latex x_i$ from the training data. The import vector machine does this by greedily picking some $latex n < N$ examples so that the reduced $latex n$ example model best approximates the full model.

Sparsification by randomized prototype selection

The sparsified kernel logistic regression therefore looks like

$latex P(y|x) \propto \mbox{exp}(\beta_{y0} + \sum_{i=1}^n\beta_{yi} k(x,u_i))$ for $latex y=0,1,\ldots,M$.

where the feature vectors $latex u_i$ are from the training data set. We can see that all we are doing is a vanilla logistic regression on a transformed feature space. The original $latex d$ dimensional feature vector has been transformed into an $latex n$ dimensional vector, where each dimension measures the kernel value of our test example $latex x$ to a prototype vector (or reference vector) $latex u_i$.

What happens if we just selected these $latex n$ prototypes randomly instead of greedily as in the import vector machine?

Avrim Blum showed that if the training data distribution is such that the two classes can be linearly separated with a margin $latex \gamma$ in the feature space induced by kernel function, then the classes can be, with high probability, linearly separated with margin $latex \gamma/2$ with low error, in the transformed feature space if we pick a sufficient number of prototypes randomly.

That's a mouthful, but basically we can use Blum's method for kernelizing logistic regression as follows. Just pick $latex n$ random vectors from your dataset (in fact they need not be labeled), compute the kernel value of an example to these $latex n$ points and use these as $latex n$ features to describe the example. We can then learn a straightforward logistic regression model on this $latex n$ dimensional feature space.

As Blum notes, $latex k(.,.)$ need not even be a valid kernel for using this method. Any reasonable similarity function would work, except the above theoretical guarantee doesn't hold.

Going a step further -- Learning the reference vectors

A key point to note is that there is no reason for the prototypes $latex \{u_1, u_2,\ldots,u_n\}$ to be part of the training data. Any reasonable reference points in the original feature space would work. We just need to pick them so as to enable the resulting classifier to separate the classes well.

Therefore I propose kernelizing logistic regression by maximizing the log-likelihood with respect to  the betas as well as the reference points. We can do this by gradient descent starting from a random $latex n$ points from our data set. The requirement is that the kernel function be differentiable with respect to the reference point $latex u$. (Note. Learning vector quantization is a somewhat related idea.)

Because of obvious symmetries, the log-likelihood function is non-convex with respect to the reference vectors, but  the local optima close to the randomly selected reference points are no worse than than the random reference points themselves.

The gradient with respect to a reference vector

Let us derive the gradient of the log-likelihood function with respect to a reference vector. First let us denote $latex k(x_i, u_j)$, i.e., the kernel value of the $latex i^{th}$ feature vector with the $latex j^{th}$ prototype by $latex z_{ij}$.

The log-likelihood of the data is given by

$latex L = \sum_{i=1}^N \sum_{y=1}^M \mbox{log}P(y|x_i) I(y=y_i)$

where $latex I(.)$ is the usual indicator function. The gradient of $latex L$ with respect to the parameters $latex \beta$ can be found in any textbook on logistic regression. The derivative of $latex P(y|x_i)$ with respect to the reference vector $latex u_l$ is

Untitled2

Putting it all together we have

Untitled2

That's it. We can update all the reference vectors in the direction given by the above gradient by an amount that is controlled by the learning rate.

Checking our sums

Let us check what happens if there is only one reference vector $latex u_1$ and $latex z_{i1} = k(x_i, u_1) = <x_i, u_1>$. That is, we use a linear kernel. We have

$latex \frac{\partial}{\partial u_1} z_{i1} = x_i$ and therefore

$latex \frac{\partial}{\partial u_1} L = \sum_{i=1}^N x_i[\beta_{y1} I(y=y_i) - \sum_{y=1}^M \beta_{y1} P(y|x_i)]$



which is very similar to the gradient of $latex L$ with respect to $latex \beta$ parameter. This is reasonable because with a linear kernel we are essentially learning a logistic regression classifier on the original feature space, where $latex beta_{y1} u_1$ takes the place of $latex \beta_y$.

If our kernel is the Gaussian radial basis function we have

$latex \frac{\partial}{\partial u_l} z_{il} = \frac{\partial}{\partial u_l} \mbox{exp}(-\lambda||x_i-u_l||^2) = 2\lambda (x_i - u_l) z_{il}$

Learning the kernel parameters

Of course gradient descent can be used to update the parameters of the kernel as well. For example we can initialize the parameter $latex \lambda$ of the Gaussian r.b.f. kernel to a reasonable value and optimize it to maximize the log-likelihood as well. The expression for the gradient with respect to the kernel parameter is

Untitled3

Going online

The optimization of the reference vectors can be done in an online fashion by stochastic gradient descent ala Bob Carpenter.

Is it better to update all the parameters of the model (betas, reference vectors, kernel parameters) at the same time or wait for one set (say the betas) to converge before updating the next set (reference vectors)?

Miscellany

1. Since conditional random fields are just generalized logistic regression classifiers, we can use the same approach to kernelize them. Even if the all the features are binary, the reference vectors can be allowed to be continuous.

2. My colleague Ken Williams suggests keeping the model small by sparsifying the reference vectors themselves. The reference vectors can be encouraged to be sparse by imposing a Laplacian L1 prior.

3. The complexity of the resulting classifier can be controlled by the choice of the kernel and the number of reference vectors. I don't have a good intuition about the effect of the two choices. For a linear kernel it seems obvious that any number of reference points should lead to the same classifier. What happens with a fixed degree polynomial kernel as the number of reference points increases?

4. Since the reference points can be moved around in the feature space, it seems extravagant to learn the betas as well. What happens when we fix the betas to random values uniformly distributed in [-1,1] and just learn the reference vectors? For what kernels do we obtain the same model as if we learned the betas as well?

5. I wonder if a similar thing can be done for support vector machines where a user specifies the kernel and the number of support vectors and the learning algorithm picks the required number of support vectors (not necessarily from the data set) such that the margin (on the training data) is maximized.

6. Ken pointed me to Archetypes, which is another related idea. In archetypal analysis the problem is to find a specified number of archetypes (reference vectors) such that all the points the data set can be as closely approximated by convex sums of the archetypes as possible. Does not directly relate to classification.

Comments

  1. There are errors in the equations. Consequence of not actually trying the algorithm out.

    ReplyDelete
  2. Thanks for this article.

    There is an typo in equation 2: it should be summed from i = 1 to d, not N

    ReplyDelete

  3. kinslover :
    Thanks for this article.
    There is an typo in equation 2: it should be summed from i = 1 to d, not N


    My bad... I misunderstood it...

    ReplyDelete

Post a Comment

Popular posts from this blog

Incremental complexity support vector machine

The Cult of Universality in Statistical Learning Theory