Surrogate learning with mean independence
In this paper we showed that if we had a feature $latex x_1$ that was class-conditionally statistically independent of the rest of the features, denoted $latex x_2$, learning a classifier between the two classes $latex y=0$ and $latex y = 1$ can be transformed into learning a predictor of $latex x_1$ from $latex x_2$ and another of $latex y$ from $latex x_1$. Since the first predictor can be learned on unlabeled examples and the second is a classifier on a 1-D space, the learning problem becomes easy. In a sense $latex x_1$ acts as a surrogate for $latex y$.
Similar ideas can be found in Ando and Zhang '07, Quadrianto et. al. '08, Blitzer et. al. '06, and others.
Derivation from mean-independence
I'll now derive a similar surrogate learning algorithm from mean independence rather than full statistical independence. Recall that the random variable $latex U$ is mean-independent of the r.v. $latex V$ if $latex E[U|V] = E[U]$. Albeit weaker than full independence, mean-independence is still a pretty strong assumption. In particular it is stronger than the lack of correlation.
We assume that the feature space contains the single feature $latex x_1$ and the rest of the features $latex x_2$. We are still in a two-class situation, i.e., $latex y \in \{0,1\}$. We further assume
1. $latex x_1$ is at least somewhat useful for classification, or in other words, $latex E[x_1|y=0] \neq E[x_1|y=1]$.
2. $latex x_1$ is class-conditionally mean-independent of $latex x_2$, i.e., $latex E[x_1|y,x_2] = E[x_1|y]$ for $latex y \in \{0,1\}$.
Now let us consider the quantity $latex E[x_1|x_2]$. We have
Notice that $latex E[x_1|x_2]$ is a convex sum of $latex E[x_1|y=0]$ and $latex E[x_1|y=1]$.
Now using the fact that $latex P(y=0|x_2) + P(y=1|x_2) = 1$ we can show after some algebra that
$latex P(y=1|x_2)=\frac{E[x_1|x_2]-E[x_1|y=0]}{E[x_1|y=1]-E[x_1|y=0]}\;\;\;\;\;\;(1)&s=1$
We have succeeded in decoupling $latex y$ and $latex x_2$ on the right hand side, which results in a simple semi-supervised classification method. We just need the class-conditional means of $latex x_1$ and a regressor (which can be learned on unlabeled data) to compute $latex E[x_1|x_2]$. Again $latex x_1$ acting as a surrogate for $latex y$ is predicted from $latex x_2$.
As opposed to the formulation in the paper this formulation easily accommodates continuous valued $latex x_1$.
Discussion
1. The first thing to note is that we are only able to write an expression for $latex P(y|x_2)$ but not $latex P(y|x_1, x_2)$. That is, we are able to weaken the independence to mean-independence at the expense of "wasting" feature $latex x_1$.
Of course if we have full statistical independence we can use $latex x_1$, by using Equation (1) and the fact that, under independence, we have
$latex P(y|x_1, x_2) \propto P(y|x_2) P(x_1|y)$.
2. If (without loss of generality) we assume that $latex E[x_1|y=1] > E[x_1|y=0]$, because $latex E[x_1|x_2]$ lies somewhere between $latex E[x_1|y=0]$ and $latex E[x_1|y=1]$, Equation (1) says that $latex P(y=1|x_2)$ is a monotonically increasing function of $latex E[x_1|x_2]$.
This means that $latex E[x_1|x_2]$ itself can be used as the classifier, and labeled examples are needed only to determine a threshold for trading off precision vs. recall. The classifier (or perhaps we should call it a ranker) therefore is built entirely from unlabeled samples.
I'll post a neat little application of this method soon.
Similar ideas can be found in Ando and Zhang '07, Quadrianto et. al. '08, Blitzer et. al. '06, and others.
Derivation from mean-independence
I'll now derive a similar surrogate learning algorithm from mean independence rather than full statistical independence. Recall that the random variable $latex U$ is mean-independent of the r.v. $latex V$ if $latex E[U|V] = E[U]$. Albeit weaker than full independence, mean-independence is still a pretty strong assumption. In particular it is stronger than the lack of correlation.
We assume that the feature space contains the single feature $latex x_1$ and the rest of the features $latex x_2$. We are still in a two-class situation, i.e., $latex y \in \{0,1\}$. We further assume
1. $latex x_1$ is at least somewhat useful for classification, or in other words, $latex E[x_1|y=0] \neq E[x_1|y=1]$.
2. $latex x_1$ is class-conditionally mean-independent of $latex x_2$, i.e., $latex E[x_1|y,x_2] = E[x_1|y]$ for $latex y \in \{0,1\}$.
Now let us consider the quantity $latex E[x_1|x_2]$. We have
$latex E[x_1|x_2]=$
$latex =E[x_1|x_2,y=0] P(y=0|x_2) + E[x_1|x_2,y=1] P(y=1|x_2) $
$latex =E[x_1|y=0] P(y=0|x_2) + E[x_1|y=1] P(y=1|x_2) $
Notice that $latex E[x_1|x_2]$ is a convex sum of $latex E[x_1|y=0]$ and $latex E[x_1|y=1]$.
Now using the fact that $latex P(y=0|x_2) + P(y=1|x_2) = 1$ we can show after some algebra that
$latex P(y=1|x_2)=\frac{E[x_1|x_2]-E[x_1|y=0]}{E[x_1|y=1]-E[x_1|y=0]}\;\;\;\;\;\;(1)&s=1$
We have succeeded in decoupling $latex y$ and $latex x_2$ on the right hand side, which results in a simple semi-supervised classification method. We just need the class-conditional means of $latex x_1$ and a regressor (which can be learned on unlabeled data) to compute $latex E[x_1|x_2]$. Again $latex x_1$ acting as a surrogate for $latex y$ is predicted from $latex x_2$.
As opposed to the formulation in the paper this formulation easily accommodates continuous valued $latex x_1$.
Discussion
1. The first thing to note is that we are only able to write an expression for $latex P(y|x_2)$ but not $latex P(y|x_1, x_2)$. That is, we are able to weaken the independence to mean-independence at the expense of "wasting" feature $latex x_1$.
Of course if we have full statistical independence we can use $latex x_1$, by using Equation (1) and the fact that, under independence, we have
$latex P(y|x_1, x_2) \propto P(y|x_2) P(x_1|y)$.
2. If (without loss of generality) we assume that $latex E[x_1|y=1] > E[x_1|y=0]$, because $latex E[x_1|x_2]$ lies somewhere between $latex E[x_1|y=0]$ and $latex E[x_1|y=1]$, Equation (1) says that $latex P(y=1|x_2)$ is a monotonically increasing function of $latex E[x_1|x_2]$.
This means that $latex E[x_1|x_2]$ itself can be used as the classifier, and labeled examples are needed only to determine a threshold for trading off precision vs. recall. The classifier (or perhaps we should call it a ranker) therefore is built entirely from unlabeled samples.
I'll post a neat little application of this method soon.
Comments
Post a Comment