Posts

Showing posts from August, 2009

The redundancy of view-redundancy for co-training

Image
Blum and Mitchell's co-training is a (very deservedly) popular semi-supervised learning algorithm that relies on class-conditional feature independence, and view-redundancy (or view-agreement) for semi-supervised learning. I will argue that the view-redundancy assumption is unnecessary, and along the way show how surrogate learning can be plugged into co-training  (which is not all that surprising considering that both are multi-view semi-sup algorithms that rely on class-conditional view-independence). I'll first explain co-training with an example. Co-training - The setup Consider a $latex y \in \{0,1\}$ classification problem on the feature space $latex \mathcal{X}=\mathcal{X}_1 \times \mathcal{X}_2$. I.e., a feature vector $latex x$ can be split into two as $latex x = [x_1, x_2]$. We make the rather restrictive assumption that $latex x_1$ and $latex x_2$ are class-conditionally independent for both classes. I.e., $latex P(x_1, x_2|y) = P(x_1|y) P(x_2|y)$ for $latex y \in

An effective kernelization of logistic regression

Image
I will present a sparse kernelization of logistic regression where the prototypes are not necessarily from the training data. Traditional sparse kernel logistic regression Consider an $latex M$ class logistic regression model given by $latex P(y|x)\propto\mbox{exp}(\beta_{y0} + \sum_{j}^{d}\beta_{yj}x_j)$ for $latex y =0,1,\ldots,M$ where $latex j$ indexes the $latex d$ features. Fitting the model to a data set $latex D = \{x_i, y_i\}_{i=1,\ldots,N}$ involves estimating the betas to maximize the likelihood of $latex D$. The above logistic regression model is quite simple (because the classifier is a linear function of the features of the example), and in some circumstances we might want a classifier that can produce a more complex decision boundary. One way to achieve this is by kernelization . We write $latex P(y|x) \propto \mbox{exp}(\beta_{y0} + \sum_{i=1}^N \beta_{yi} k(x,x_i))$ for $latex y=0,1,\ldots,M$. where $latex k(.,.)$ is a kernel function. In order to be able to use this c

A surrogate learning mystery

I'll present an application of the surrogate learning idea in the previous post . It is mildly surprising at first blush, which I'll contrive to make more mysterious. Murders she induced For readers who appreciate this sort of a thing, here's the grandmother of all murder mysteries. In this one Miss Marple solves a whole bunch of unrelated murders all at once. Miss Marple having finally grown wise to the unreliability of feminine intuition in solving murder mysteries spent some time learning statistics. She then convinced one of her flat-footed friends at the Scotland Yard to give her the files on all the unsolved murders, that were just sitting around gathering dust and waiting for someone with her imagination and statistical wiles. She came home with the massive pile of papers and sat down to study them. The file on each murder carefully listed all the suspects, with their possible motives, their accessibility to the murder weapon, psychological characteristics, previous

Surrogate learning with mean independence

In this paper we showed that if we had a feature $latex x_1$ that was class-conditionally statistically independent of the rest of the features, denoted $latex x_2$, learning a classifier between the two classes $latex y=0$ and $latex y = 1$ can be transformed into learning a predictor of $latex x_1$ from $latex x_2$ and another of $latex y$ from $latex x_1$. Since the first predictor can be learned on unlabeled examples and the second is a classifier on a 1-D space, the learning problem becomes easy. In a sense $latex x_1$ acts as a surrogate for $latex y$. Similar ideas can be found in Ando and Zhang '07 , Quadrianto et. al. '08 , Blitzer et. al. '06 , and others. Derivation from mean-independence I'll now derive a similar surrogate learning algorithm from mean independence rather than full statistical independence. Recall that the random variable $latex U$ is mean-independent of  the r.v. $latex V$ if $latex E[U|V] = E[U]$. Albeit weaker than full independence, mean

Online logistic regression

I like Bob Carpenter's paper -- " Lazy sparse online logistic regression ... ". In addition to being a nice overview of logistic regression, it describes online training for logistic regression by stochastic gradient descent under various parameter priors. Another cool feature is that if the feature dimensionality is large but the examples are sparse, only the parameters corresponding to the features that are non-zero (for the current example) need to be updated (this is the lazy part).  It is super easy to implement (a few hundred lines in C, for an svm_light like stand-alone application) and trains very fast, as attested to by Leon Bottou. There is one issue about the regularization discount in a truly online setting where there is no "end of epoch", which was discussed by Carpenter. He suggests leaving it at a constant, which, as he points out, corresponds to steadily decreasing the variance of the prior with the number of examples. In my implementation I u

Estimation of a distribution from i.i.d. sums

Here's an estimation problem that I ran into not long ago while working on a problem in entity co-reference resolution in natural language documents. Let $latex X$ be a random variable taking on values in $latex \{0,1,2,\ldots\}$.  We are given data $latex D=\{(N_1,S_1), (N_2,S_2),\ldots,(N_k,S_k)\}$, where $latex S_i$ is the sum of $latex N_i$ independent draws of $latex X$ for $latex i = 1, 2,\ldots, k$. We are required to estimate the distribution of $latex X$ from $latex D$. For some distributions of $latex X$ we can use the method-of-moments. For example if $latex X \sim \mbox{Poisson}(\lambda)$, we know that the mean of $latex X$ is $latex \lambda$. We can therefore estimate $latex \lambda$ as the sample mean, i.e., $latex \hat{\lambda}=\frac{S_1+S_2+\ldots+S_k}{N_1+N_2+\ldots+N_k}$. Because of the nice additive property of the parameters for sums of i.i.d. poisson random variables, the maximum likelihood estimate also turns out be the same as $latex \hat{\lambda}$. The probl

Probability density estimation as classification

Image
Perhaps it has always been obvious to exalted statistical minds that density estimation can be viewed as classification and (perhaps) done using a classifier. Assume that we have samples $latex \{x_i\}_{i=1,\ldots,N}$ of a random vector $latex X$ whose distribution has bounded support. In fact, without loss of generality, let the support be the unit hypercube $latex [0,1]^d$. We are required to estimate $latex P_X(x)$ the density of $latex X$. Now assume that we generate a bunch of samples $latex \{z_i\}_{i=1,\ldots,M}$ uniformly distributed in $latex [0,1]^d$. We assign a label $latex y = 1$ to all the samples $latex x_i$ and a label $latex y = 0$ to all $latex z_i$ and a build a classifier $latex \psi$ between the two sample sets. In other words we construct an estimate $latex P_\psi(y=1|x)$ of the posterior class probability $latex P(y=1|x)$ $latex \forall x \in [0,1]^d$. Now, we know that where $latex U(x) = 1$, the uniform distribution over the unit hypercube. The above equation c