Incremental complexity support vector machine
One of the problems with using complex kernels with support vector machines is that they tend to produce classification boundaries that are odd, like the ones below.
(I generated them using a java SVM applet from here, whose reliability I cannot swear to, but have no reason to doubt.) Both SVM boundaries are with Gaussian RBF kernels: the first with $latex \sigma = 1$ and the second with $latex \sigma = 10$ on two different data sets.
Note the segments of the boundary to the east of the blue examples in the bottom figure, and those to the south and to the north-east of the blue examples in the top figure. They seem to violate intuition.
The reason for these anomalous boundaries is of course the large complexity of the function class induced by the RBF kernel with large $latex \sigma$, which gives the classifier a propensity to make subtle distinctions even in regions of somewhat low example density.
A possible solution: using complex kernels only where they are needed
We propose to build a cascaded classifier, which we will call Incremental Complexity SVM (ICSVM), as follows.
We are given a sequence of kernels $latex K_1, K_2,\ldots,K_m$ of increasing complexity. For example the sequence is of polynomial kernels, where $latex K_i$ is the polynomial kernel with degree $latex i$.
The learning algorithm first learns an SVM classifier $latex \psi_1$ with kernel $latex K_1$, that classifies a reasonable portion of the examples with a large margin $latex \lambda_1$. This can be accomplished by setting the SVM cost parameter $latex C$ to some low value.
Now all the examples outside the margin are thrown out, and another SVM classifier $latex \psi_2$ with kernel $latex K_2$ is learned, so that a reasonable portion of the remaining examples are classified with some large margin $latex \lambda_2$.
This procedure is continued until all the examples are classified outside the margin or the set of kernels is exhausted. The final classifier is a combination of all the classifiers $latex \psi_i$.
A test example can be classified as follows. We first apply classifier $latex \psi_1$ to the test example, and if it is classified with margin $latex \geq \lambda_1$, we output the assigned label and stop. If not we classify it with classifier $latex \psi_2$ in a similar fashion, and so on...
Such a scheme will avoid anomalous boundaries as those in the pictures above.
Discussion
1. With all the work that has been done on SVMs it is very likely that this idea or something very similar has been thought of, but I haven't come across it.
2. There is some work on kernel learning where a convex combination of kernels is learned but I think that is a different idea.
3. One nice thing about such a classification scheme is that at run-time it will expend less computational resources on easier examples and more on more difficult ones. As my thesis supervisor used to say, it is silly for most classifiers to insist on acting exactly the same way on both easy and hard cases.
4. The choices of the cost parameters $latex C$ for the SVMs is critical for the accuracy of the final classifier. Is there a way of formulating the choice of the parameters in terms of minimizing some overall upper bound on the generalization error from statistical learning theory?
5. Is there a one-shot SVM formulation with the set of kernels that exactly or approximately acts like our classifier?
6. The weird island-effect and what Ken calls the lava-lamp problem in the boundaries above are not just artifacts of SVMs. We would expect a sparse kernel logistic regression to behave similarly. It would be interesting to do a similar incremental kernel thing with other kernel-based classifiers.
(I generated them using a java SVM applet from here, whose reliability I cannot swear to, but have no reason to doubt.) Both SVM boundaries are with Gaussian RBF kernels: the first with $latex \sigma = 1$ and the second with $latex \sigma = 10$ on two different data sets.
Note the segments of the boundary to the east of the blue examples in the bottom figure, and those to the south and to the north-east of the blue examples in the top figure. They seem to violate intuition.
The reason for these anomalous boundaries is of course the large complexity of the function class induced by the RBF kernel with large $latex \sigma$, which gives the classifier a propensity to make subtle distinctions even in regions of somewhat low example density.
A possible solution: using complex kernels only where they are needed
We propose to build a cascaded classifier, which we will call Incremental Complexity SVM (ICSVM), as follows.
We are given a sequence of kernels $latex K_1, K_2,\ldots,K_m$ of increasing complexity. For example the sequence is of polynomial kernels, where $latex K_i$ is the polynomial kernel with degree $latex i$.
The learning algorithm first learns an SVM classifier $latex \psi_1$ with kernel $latex K_1$, that classifies a reasonable portion of the examples with a large margin $latex \lambda_1$. This can be accomplished by setting the SVM cost parameter $latex C$ to some low value.
Now all the examples outside the margin are thrown out, and another SVM classifier $latex \psi_2$ with kernel $latex K_2$ is learned, so that a reasonable portion of the remaining examples are classified with some large margin $latex \lambda_2$.
This procedure is continued until all the examples are classified outside the margin or the set of kernels is exhausted. The final classifier is a combination of all the classifiers $latex \psi_i$.
A test example can be classified as follows. We first apply classifier $latex \psi_1$ to the test example, and if it is classified with margin $latex \geq \lambda_1$, we output the assigned label and stop. If not we classify it with classifier $latex \psi_2$ in a similar fashion, and so on...
Such a scheme will avoid anomalous boundaries as those in the pictures above.
Discussion
1. With all the work that has been done on SVMs it is very likely that this idea or something very similar has been thought of, but I haven't come across it.
2. There is some work on kernel learning where a convex combination of kernels is learned but I think that is a different idea.
3. One nice thing about such a classification scheme is that at run-time it will expend less computational resources on easier examples and more on more difficult ones. As my thesis supervisor used to say, it is silly for most classifiers to insist on acting exactly the same way on both easy and hard cases.
4. The choices of the cost parameters $latex C$ for the SVMs is critical for the accuracy of the final classifier. Is there a way of formulating the choice of the parameters in terms of minimizing some overall upper bound on the generalization error from statistical learning theory?
5. Is there a one-shot SVM formulation with the set of kernels that exactly or approximately acts like our classifier?
6. The weird island-effect and what Ken calls the lava-lamp problem in the boundaries above are not just artifacts of SVMs. We would expect a sparse kernel logistic regression to behave similarly. It would be interesting to do a similar incremental kernel thing with other kernel-based classifiers.
I recently met Alexandros Kalousis at ECML 2009. Together with his group he was presenting this paper:
ReplyDeletehttp://cui.unige.ch/~kalousis/papers/complex/HyenEtAl_MKL_ECML2009.pdf
which could be of interest for you. As far as I remember it is about a MKL technique based on bounds together with an explicit look to the minimum size of the sphere that contains the training instances.
I wonder about other variations on this theme. For instance, consider the artificial data set generated by the following R code:
ReplyDeleteps <- data.frame(x=rnorm(2500), y=rnorm(2500))
isgreen <- ps$x <= -.5 | (ps$x = 0)
plot(ps, col=c("green","red")[1+isgreen])
http://limnus.com/~ken/Rplot.png
A perfect double-classifier for this set would be a linear classifier centered on the y-axis with margin 1, followed by a linear classifier centered on the x axis. However, it seems a bit difficult to me to make a learning procedure that would learn this.
Hmm, this blog must be HTML-aware. =) I meant something like this:
ReplyDeleteisgreen <- ps$x <= -.5 | (ps$x <= .5 & ps$y >= 0)
Ken, my procedure will have inductive bias like any other classifier, so will possibly end up with something other than the Bayes optimal classifier on your problem. But we hope some of the estimation error resulting from complex kernels will be reduced at very little increase in approximation error (at least for nice distributions, which are more common in real life).
ReplyDeleteYour point about the possibility of using a linear kernel for the second classifier is interesting. Perhaps if $latex K_1, K_2,\ldots$ etc. are all linear, and we just make a piece-wise linear approximation of the boundary, it would just as well as if we picked a set of kernels with increasing complexity.