Surrogate learning with mean independence

In this paper we showed that if we had a feature $latex x_1$ that was class-conditionally statistically independent of the rest of the features, denoted $latex x_2$, learning a classifier between the two classes $latex y=0$ and $latex y = 1$ can be transformed into learning a predictor of $latex x_1$ from $latex x_2$ and another of $latex y$ from $latex x_1$. Since the first predictor can be learned on unlabeled examples and the second is a classifier on a 1-D space, the learning problem becomes easy. In a sense $latex x_1$ acts as a surrogate for $latex y$.

Similar ideas can be found in Ando and Zhang '07, Quadrianto et. al. '08, Blitzer et. al. '06, and others.

Derivation from mean-independence

I'll now derive a similar surrogate learning algorithm from mean independence rather than full statistical independence. Recall that the random variable $latex U$ is mean-independent of  the r.v. $latex V$ if $latex E[U|V] = E[U]$. Albeit weaker than full independence, mean-independence is still a pretty strong assumption. In particular it is stronger than the lack of correlation.

We assume that the feature space contains the single feature $latex x_1$ and the rest of the features $latex x_2$. We are still in a two-class situation, i.e., $latex y \in \{0,1\}$. We further assume

1. $latex x_1$ is at least somewhat useful for classification, or in other words, $latex E[x_1|y=0] \neq E[x_1|y=1]$.

2. $latex x_1$ is class-conditionally mean-independent of $latex x_2$, i.e., $latex E[x_1|y,x_2] = E[x_1|y]$ for $latex y \in \{0,1\}$.

Now let us consider the quantity $latex E[x_1|x_2]$. We have

$latex E[x_1|x_2]=$


$latex =E[x_1|x_2,y=0] P(y=0|x_2) + E[x_1|x_2,y=1] P(y=1|x_2) $


$latex =E[x_1|y=0] P(y=0|x_2) + E[x_1|y=1] P(y=1|x_2) $



Notice that $latex E[x_1|x_2]$ is a convex sum of $latex E[x_1|y=0]$ and $latex E[x_1|y=1]$.

Now using the fact that $latex P(y=0|x_2) + P(y=1|x_2) = 1$  we can show after some algebra that

$latex P(y=1|x_2)=\frac{E[x_1|x_2]-E[x_1|y=0]}{E[x_1|y=1]-E[x_1|y=0]}\;\;\;\;\;\;(1)&s=1$

We have succeeded in decoupling $latex y$ and $latex x_2$ on the right hand side, which results in a simple semi-supervised classification method. We just need the class-conditional means of $latex x_1$ and a regressor (which can be learned on unlabeled data) to compute $latex E[x_1|x_2]$.  Again $latex x_1$ acting as a surrogate for $latex y$ is predicted from $latex x_2$.

As opposed to the formulation in the paper this formulation easily accommodates continuous valued $latex x_1$.

Discussion

1. The first thing to note is that we are only able to write an expression for $latex P(y|x_2)$ but not $latex P(y|x_1, x_2)$. That is, we are able to weaken the independence to mean-independence at the expense of "wasting" feature $latex x_1$.

Of course if we have full statistical independence we can use $latex x_1$, by using Equation (1) and the fact that, under independence, we have

$latex P(y|x_1, x_2) \propto P(y|x_2) P(x_1|y)$.

2. If (without loss of generality) we assume that $latex E[x_1|y=1] > E[x_1|y=0]$, because $latex E[x_1|x_2]$ lies somewhere between $latex E[x_1|y=0]$ and $latex E[x_1|y=1]$, Equation (1) says that $latex P(y=1|x_2)$ is a monotonically increasing function of $latex E[x_1|x_2]$.

This means that $latex E[x_1|x_2]$ itself can be used as the classifier, and labeled examples are needed only to determine a threshold for trading off precision vs. recall. The classifier (or perhaps we should call it a ranker) therefore is built entirely from unlabeled samples.

I'll post a neat little application of this method soon.



Comments

Popular posts from this blog

Incremental complexity support vector machine

An effective kernelization of logistic regression

The Cult of Universality in Statistical Learning Theory