Posts

Showing posts with the label Classification

Regularized Minimax on Synthetic Data

Image
First I would like to mention that, since my last post , I came across the paper from 2005 on Robust Supervised Learning by J. Andrew Bagnell that proposed almost exactly the same regularized minimax algorithm as the one I derived. He motivates the problem slightly differently and weights each example separately and not based on types, but the details are essentially identical. Experiments on Synthetic Data I tried the algorithm on some synthetic data and a linear logistic regression model. The results are shown in the figures below. In both examples, there are examples from two classes (red and blue). Each class is a drawn from a  mixture of two normal distributions (i.e., there are two types per class). The types are shown as red squares and red circles, and blue diamonds and blue triangles. Class-conditionally the types have a skewed distribution. There are 9 times as many red squares as red circles, and 9 times as many blue diamonds as triangles. We would expect a plain logistic ...

Regularized Minimax for Robust Learning

Image
This post is about using minimax estimation for robust learning when the test data distribution is expected to be different from the training data distribution, i.e learning that is robust to data drift. Cost Sensitive Loss Functions Given a training data set $latex D = \{x_i, y_i\}_{i=1,\ldots,N}$, most learning algorithms learn a classifier $latex \phi$ that is parametrized by a vector $latex w$ by minimizing a loss function where $latex l(x_i, y_i, w)$ is the loss on example $latex i$ and $latex f(w)$ is some function that penalizes complexity. For example for logistic regression the loss function looks like for some $latex \lambda > 0$. If, in addition, the examples came with costs $latex c_i$ (that somehow specify the importance of minimizing the loss on that particular example), we can perform cost sensitive learning by over/under-sampling the training data or minimize a cost-weighted loss function (see this paper by Zadrozny et. al. ) We further constrain $latex \sum_i^N c_i...

Sparse online kernel logistic regression

Image
In a previous post , I talked about an idea for sparsifying kernel logistic regression by using random prototypes. I also showed how the prototypes themselves (as well as the kernel parameters) can be updated. (Update Apr 2010. Slides for a tutorial on this stuff.) (As a brief aside, I note that an essentially identical approach was used to sparsify Gaussian Process Regression by Snelson and Gharahmani . For GPR they use gradient ascent on the log-likelihood to learn the prototypes and labels, which is akin to learning the prototypes and betas for logistic regression. The set of prototypes and labels generated by their algorithm can be thought of as a pseudo training set.) I recently (with the help of my super-competent Java developer colleague Hiroko Bretz) implemented the sparse kernel logistic regression algorithm. The learning is done in an online fashion (i.e., using stochastic gradient descent). It seems to perform reasonably well on large datasets. Below I'll show its behav...

Incremental complexity support vector machine

Image
One of the problems with using complex kernels with support vector machines is that they tend to produce classification boundaries that are odd, like the ones below. (I generated them using a java SVM applet from here , whose reliability I cannot swear to, but have no reason to doubt.) Both SVM boundaries are with Gaussian RBF kernels: the first with $latex \sigma = 1$ and the second with $latex \sigma = 10$ on two different data sets. Note the segments of the boundary to the east of the blue examples in the bottom figure, and those to the south and to the north-east of the blue examples in the top figure. They seem to violate intuition. The reason for these anomalous boundaries is of course the large complexity of the function class induced by the RBF kernel with large $latex \sigma$, which gives the classifier a propensity to make subtle distinctions even in regions of  somewhat low example density. A possible solution: using complex kernels only where they are needed We propose to b...

Training data bias caused by active learning

As opposed to the traditional supervised learning setting where the labeled training data is generated (we hope) independently and identically, in active learning the learner is allowed to select points for which labels are requested. Because it is often impossible to construct the equivalent real-world object from its feature values, almost universally, active learning is pool-based . That is we start with a large pool of unlabeled data and the learner (usually sequentially) picks the objects from the pool for which the labels are requested. One unavoidable effect of active learning is that we end up with a biased training data set. If the true data distribution is $latex P(x,y)$, we have data drawn from some distribution $latex \hat{P}(x,y)$ (as always $latex x$ is the feature vector and $latex y$ is the class label). We would like to correct for this bias so it does not lead to learning an incorrect classifier. And furthermore we want to use this biased data set to accurately evalu...

The redundancy of view-redundancy for co-training

Image
Blum and Mitchell's co-training is a (very deservedly) popular semi-supervised learning algorithm that relies on class-conditional feature independence, and view-redundancy (or view-agreement) for semi-supervised learning. I will argue that the view-redundancy assumption is unnecessary, and along the way show how surrogate learning can be plugged into co-training  (which is not all that surprising considering that both are multi-view semi-sup algorithms that rely on class-conditional view-independence). I'll first explain co-training with an example. Co-training - The setup Consider a $latex y \in \{0,1\}$ classification problem on the feature space $latex \mathcal{X}=\mathcal{X}_1 \times \mathcal{X}_2$. I.e., a feature vector $latex x$ can be split into two as $latex x = [x_1, x_2]$. We make the rather restrictive assumption that $latex x_1$ and $latex x_2$ are class-conditionally independent for both classes. I.e., $latex P(x_1, x_2|y) = P(x_1|y) P(x_2|y)$ for $latex y \in ...

An effective kernelization of logistic regression

Image
I will present a sparse kernelization of logistic regression where the prototypes are not necessarily from the training data. Traditional sparse kernel logistic regression Consider an $latex M$ class logistic regression model given by $latex P(y|x)\propto\mbox{exp}(\beta_{y0} + \sum_{j}^{d}\beta_{yj}x_j)$ for $latex y =0,1,\ldots,M$ where $latex j$ indexes the $latex d$ features. Fitting the model to a data set $latex D = \{x_i, y_i\}_{i=1,\ldots,N}$ involves estimating the betas to maximize the likelihood of $latex D$. The above logistic regression model is quite simple (because the classifier is a linear function of the features of the example), and in some circumstances we might want a classifier that can produce a more complex decision boundary. One way to achieve this is by kernelization . We write $latex P(y|x) \propto \mbox{exp}(\beta_{y0} + \sum_{i=1}^N \beta_{yi} k(x,x_i))$ for $latex y=0,1,\ldots,M$. where $latex k(.,.)$ is a kernel function. In order to be able to use this c...

A surrogate learning mystery

I'll present an application of the surrogate learning idea in the previous post . It is mildly surprising at first blush, which I'll contrive to make more mysterious. Murders she induced For readers who appreciate this sort of a thing, here's the grandmother of all murder mysteries. In this one Miss Marple solves a whole bunch of unrelated murders all at once. Miss Marple having finally grown wise to the unreliability of feminine intuition in solving murder mysteries spent some time learning statistics. She then convinced one of her flat-footed friends at the Scotland Yard to give her the files on all the unsolved murders, that were just sitting around gathering dust and waiting for someone with her imagination and statistical wiles. She came home with the massive pile of papers and sat down to study them. The file on each murder carefully listed all the suspects, with their possible motives, their accessibility to the murder weapon, psychological characteristics, previous ...

Surrogate learning with mean independence

In this paper we showed that if we had a feature $latex x_1$ that was class-conditionally statistically independent of the rest of the features, denoted $latex x_2$, learning a classifier between the two classes $latex y=0$ and $latex y = 1$ can be transformed into learning a predictor of $latex x_1$ from $latex x_2$ and another of $latex y$ from $latex x_1$. Since the first predictor can be learned on unlabeled examples and the second is a classifier on a 1-D space, the learning problem becomes easy. In a sense $latex x_1$ acts as a surrogate for $latex y$. Similar ideas can be found in Ando and Zhang '07 , Quadrianto et. al. '08 , Blitzer et. al. '06 , and others. Derivation from mean-independence I'll now derive a similar surrogate learning algorithm from mean independence rather than full statistical independence. Recall that the random variable $latex U$ is mean-independent of  the r.v. $latex V$ if $latex E[U|V] = E[U]$. Albeit weaker than full independence, mean...

Online logistic regression

I like Bob Carpenter's paper -- " Lazy sparse online logistic regression ... ". In addition to being a nice overview of logistic regression, it describes online training for logistic regression by stochastic gradient descent under various parameter priors. Another cool feature is that if the feature dimensionality is large but the examples are sparse, only the parameters corresponding to the features that are non-zero (for the current example) need to be updated (this is the lazy part).  It is super easy to implement (a few hundred lines in C, for an svm_light like stand-alone application) and trains very fast, as attested to by Leon Bottou. There is one issue about the regularization discount in a truly online setting where there is no "end of epoch", which was discussed by Carpenter. He suggests leaving it at a constant, which, as he points out, corresponds to steadily decreasing the variance of the prior with the number of examples. In my implementation I u...