Posts

Showing posts from August, 2009

The redundancy of view-redundancy for co-training

Image
Blum and Mitchell's co-training is a (very deservedly) popular semi-supervised learning algorithm that relies on class-conditional feature independence, and view-redundancy (or view-agreement) for semi-supervised learning. I will argue that the view-redundancy assumption is unnecessary, and along the way show how surrogate learning can be plugged into co-training  (which is not all that surprising considering that both are multi-view semi-sup algorithms that rely on class-conditional view-independence). I'll first explain co-training with an example. Co-training - The setup Consider a latexy{0,1} classification problem on the feature space latexX=X1×X2. I.e., a feature vector latexx can be split into two as latexx=[x1,x2]. We make the rather restrictive assumption that latexx1 and latexx2 are class-conditionally independent for both classes. I.e., latexP(x1,x2|y)=P(x1|y)P(x2|y) for $latex y \in ...

An effective kernelization of logistic regression

Image
I will present a sparse kernelization of logistic regression where the prototypes are not necessarily from the training data. Traditional sparse kernel logistic regression Consider an latexM class logistic regression model given by latexP(y|x)exp(βy0+djβyjxj) for latexy=0,1,,M where latexj indexes the latexd features. Fitting the model to a data set latexD={xi,yi}i=1,,N involves estimating the betas to maximize the likelihood of latexD. The above logistic regression model is quite simple (because the classifier is a linear function of the features of the example), and in some circumstances we might want a classifier that can produce a more complex decision boundary. One way to achieve this is by kernelization . We write latexP(y|x)exp(βy0+Ni=1βyik(x,xi)) for latexy=0,1,,M. where latexk(.,.) is a kernel function. In order to be able to use this c...

A surrogate learning mystery

I'll present an application of the surrogate learning idea in the previous post . It is mildly surprising at first blush, which I'll contrive to make more mysterious. Murders she induced For readers who appreciate this sort of a thing, here's the grandmother of all murder mysteries. In this one Miss Marple solves a whole bunch of unrelated murders all at once. Miss Marple having finally grown wise to the unreliability of feminine intuition in solving murder mysteries spent some time learning statistics. She then convinced one of her flat-footed friends at the Scotland Yard to give her the files on all the unsolved murders, that were just sitting around gathering dust and waiting for someone with her imagination and statistical wiles. She came home with the massive pile of papers and sat down to study them. The file on each murder carefully listed all the suspects, with their possible motives, their accessibility to the murder weapon, psychological characteristics, previous ...

Surrogate learning with mean independence

In this paper we showed that if we had a feature latexx1 that was class-conditionally statistically independent of the rest of the features, denoted latexx2, learning a classifier between the two classes latexy=0 and latexy=1 can be transformed into learning a predictor of latexx1 from latexx2 and another of latexy from latexx1. Since the first predictor can be learned on unlabeled examples and the second is a classifier on a 1-D space, the learning problem becomes easy. In a sense latexx1 acts as a surrogate for latexy. Similar ideas can be found in Ando and Zhang '07 , Quadrianto et. al. '08 , Blitzer et. al. '06 , and others. Derivation from mean-independence I'll now derive a similar surrogate learning algorithm from mean independence rather than full statistical independence. Recall that the random variable latexU is mean-independent of  the r.v. latexV if latexE[U|V]=E[U]. Albeit weaker than full independence, mean...

Online logistic regression

I like Bob Carpenter's paper -- " Lazy sparse online logistic regression ... ". In addition to being a nice overview of logistic regression, it describes online training for logistic regression by stochastic gradient descent under various parameter priors. Another cool feature is that if the feature dimensionality is large but the examples are sparse, only the parameters corresponding to the features that are non-zero (for the current example) need to be updated (this is the lazy part).  It is super easy to implement (a few hundred lines in C, for an svm_light like stand-alone application) and trains very fast, as attested to by Leon Bottou. There is one issue about the regularization discount in a truly online setting where there is no "end of epoch", which was discussed by Carpenter. He suggests leaving it at a constant, which, as he points out, corresponds to steadily decreasing the variance of the prior with the number of examples. In my implementation I u...

Estimation of a distribution from i.i.d. sums

Here's an estimation problem that I ran into not long ago while working on a problem in entity co-reference resolution in natural language documents. Let latexX be a random variable taking on values in latex{0,1,2,}.  We are given data latexD={(N1,S1),(N2,S2),,(Nk,Sk)}, where latexSi is the sum of latexNi independent draws of latexX for latexi=1,2,,k. We are required to estimate the distribution of latexX from latexD. For some distributions of latexX we can use the method-of-moments. For example if latexXPoisson(λ), we know that the mean of latexX is latexλ. We can therefore estimate latexλ as the sample mean, i.e., latexˆλ=S1+S2++SkN1+N2++Nk. Because of the nice additive property of the parameters for sums of i.i.d. poisson random variables, the maximum likelihood estimate also turns out be the same as latexˆλ. The probl...

Probability density estimation as classification

Image
Perhaps it has always been obvious to exalted statistical minds that density estimation can be viewed as classification and (perhaps) done using a classifier. Assume that we have samples latex{xi}i=1,,N of a random vector latexX whose distribution has bounded support. In fact, without loss of generality, let the support be the unit hypercube latex[0,1]d. We are required to estimate latexPX(x) the density of latexX. Now assume that we generate a bunch of samples latex{zi}i=1,,M uniformly distributed in latex[0,1]d. We assign a label latexy=1 to all the samples latexxi and a label latexy=0 to all latexzi and a build a classifier latexψ between the two sample sets. In other words we construct an estimate latexPψ(y=1|x) of the posterior class probability latexP(y=1|x) latexx[0,1]d. Now, we know that where latexU(x)=1, the uniform distribution over the unit hypercube. The above equation c...