The redundancy of view-redundancy for co-training

Blum and Mitchell's co-training is a (very deservedly) popular semi-supervised learning algorithm that relies on class-conditional feature independence, and view-redundancy (or view-agreement) for semi-supervised learning.

I will argue that the view-redundancy assumption is unnecessary, and along the way show how surrogate learning can be plugged into co-training  (which is not all that surprising considering that both are multi-view semi-sup algorithms that rely on class-conditional view-independence).

I'll first explain co-training with an example.

Co-training - The setup

Consider a $latex y \in \{0,1\}$ classification problem on the feature space $latex \mathcal{X}=\mathcal{X}_1 \times \mathcal{X}_2$. I.e., a feature vector $latex x$ can be split into two as $latex x = [x_1, x_2]$.

We make the rather restrictive assumption that $latex x_1$ and $latex x_2$ are class-conditionally independent for both classes. I.e., $latex P(x_1, x_2|y) = P(x_1|y) P(x_2|y)$ for $latex y \in \{0,1\}$.

(Note that unlike surrogate learning with mean-independence, both $latex \mathcal{X}_1$  and $latex \mathcal{X}_2$ are allowed to be multi-dimensional.)

Co-training makes an additional assumption that either view is sufficient for classification. This view-redundancy assumption basically states that the probability mass in the region of the feature space, where the Bayes optimal classifiers on the two views disagree with each other, is zero.

(The original co-training paper actually relaxes this assumption in the epilogue, but it is unnecessary to begin with, and the assumption has proliferated in later manifestations of co-training.)

We are given some labeled data (or a weak classifier on one of the views) and an large supply of unlabeled data. We are now ready to proceed with co-training to construct a Bayes optimal classifier.

Co-training - The algorithm

The algorithm is very simple. We use our weak classifier, say $latex h_1(x_1)$, (which we were given, or which we constructed using the measly labeled data) on the one view ($latex x_1$) to classify all the unlabeled data.  We select the examples classified with high confidence, and use these as labeled examples (using the labels assigned by the weak classifier) to train a classifier $latex h_2(x_2)$ on the other view ($latex x_2$).

We now classify the unlabeled data with $latex h_2(x_2)$ to similarly generate labeled data to retrain $latex h_1(x_1)$. This back-and-forth procedure is repeated until exhaustion.

Under the above assumptions (and with "sufficient" unlabeled data) $latex h_1$ and $latex h_2$ converge to the Bayes optimal classifiers on the respective feature views. Since either view is enough for classification, we just pick one of the classifiers and release it into the wild.

Co-training - Why does it work?

I'll try to present an intuitive explanation of co-training using the example depicted in the following figure. Please focus on it intently.

co-training

The feature vector $latex x$ in the example is 2-dimensional and both views $latex x_1$ and $latex x_2$ are  1-dimensional. The class-conditional distributions are uncorrelated and jointly Gaussian (which means independent) and depicted by their equiprobability contours in the figure. The marginal class-conditional distributions are show along the two axes. Class $latex y=0$ is shown in red and class $latex y=1$ is shown in blue. The picture also shows some unlabeled examples.

Assume we have a weak classifier $latex h_1(x_1)$ on the first view. If we extend the classification boundary for this classifier to the entire space $latex x$,  the boundary necessarily comprises of lines parallel to the $latex x_2$ axis.  Let's say there is only one such line and all the examples below that line are assigned class $latex y=1$ and all the examples above are assigned class $latex y=0$.

We now ignore all the examples close to the classification boundary of $latex h_1$ (i.e., all the examples in the grey band) and project the rest of the points onto the $latex x_2$ axis.

How will these projected points be distributed along $latex x_2$?

Since the examples that were ignored (in the grey band) were selected based on their $latex x_1$ values, owing to class-conditional independence, the marginal distribution along $latex x_2$ for either class will be exactly the same as if none of the samples were ignored. This is the key reason for the conditional-independence assumption.

The procedure has two subtle, but largely innocuous, consequences.

First, since we don't know how many class $latex 0$ and class $latex 1$ examples are in the grey band the relative ratio of the examples of the two classes in the not-ignored set may not the same as in the original full unlabeled sample set. If the class priors $latex P(y)$ are known, this can easily be corrected for when we learn $latex h_2(x_2)$. If the class priors are unknown other assumptions on $latex h_1(x_1)$ are necessary.

Second, when we project the unlabeled examples on to $latex x_2$ we assign them the labels given to them by $latex h_1$ which can be erroneous. In the figure above, there will be examples in the region indicated by A that are actually class $latex 1$ but have been assigned class $latex 0$, and examples in region B that were from class $latex 0$ but were called class $latex 1$.

Again because of the class-conditional independence assumption these erroneously labeled examples will be distributed according to the marginal class-conditional $latex x_2$ distributions. I.e., in the figure above we imagine, along the $latex x_2$ axis, a very low amplitude blue distribution with the same shape and location as the red distribution, and a very low amplitude red distribution with the same shape under the blue distribution. (Note . This is the $latex (\alpha, \beta)$ noise in the original co-training paper.)

This amounts to having a labeled training set with label errors but with errors being generated independently of the location in the space. That is the number of errors in a region in the space is proportional to the number of examples in that region. These proportionally distributed errors are then washed out by the correctly labeled examples when we learn $latex h_2$.

To recap, co-training works because of the following fact. Starting from a weak classifier $latex h_1$ on $latex x_1$, we can generate very accurate and unbiased training data to train a classifier on $latex x_2$.

No need for view-redundancy

Notice that, in the above example, we made no appeal to any kind of view-redundancy (other than whatever we may get gratis from the independence assumption).

The vigilant reader may however level the following two objections against the above argument-by-example.

1. We build $latex h_1(x_1)$ and $latex h_2(x_2)$ separately. So when the training is done, without view redundancy, we have not shown a way to pick from the two to apply to new test data.

2. At every iteration we need to select unlabeled samples that were classified with high-confidence by $latex h_1$ to feed to the trainer for $latex h_2$. Without view-redundancy may be none of the samples will be classified with high confidence.

The first objection is easy to respond to. We pick neither $latex h_1$ nor $latex h_2$ for new test data. Instead we combine them to obtain a classifier $latex h(x_1,x_2)$. This is well justified because, under class-conditional independence, $latex P(y|x_1,x_2) \propto P(y|x_1) P(y|x_2)$.

We react to the second objection by dropping the requirement of classifying with high-confidence altogether.

Dropping the high-confidence requirement by surrogate learning

Instead of training $latex h_2(x_2)$ with examples that are classified with high confidence by $latex h_1(x_1)$, we train $latex h_2(x_2)$ with all the examples (using the scores assigned to them by $latex h_1(x_1)$).

At some iteration of co-training, define the random variable $latex z_1 = h_1(x_1)$. Since $latex x_1$ and $latex x_2$ are class-conditionally independent, $latex z_1$ and $latex x_2$ are also class-conditionally independent. In particular $latex z_1$  is class-conditionally mean-independent of $latex x_2$. Furthermore if $latex h_1$ is even a weakly useful classifier, barring pathologies, it will satisfy $latex E[z_1|y=0] \neq E[z_1|y=1]$.

We can therefore apply surrogate learning under mean-independence to learn the classifier on $latex x_2$. (This is essentially the same idea as Co-EM, which was introduced without much theoretical justification.)

Discussion

Hopefully the above argument has convinced the reader that the class-conditional view independence assumption obviates the view-redundancy requirement.

A natural question to ask is whether the reverse is true. That is, if we are given view-redundancy, can we completely eliminate the requirement of class-conditional independence? We can immediately see that the answer is no.

For example, we can duplicate all the features for any classification problem so that view-redundancy holds trivially between the two replicates. Moreover, the second replicate will be statistically fully dependent on the first.

Now if we are given a weak classifier on the first view (or replicate) and try to use its predictions on an unlabeled data set to obtain training data for the second, it would be equivalent to feeding back the predictions of a classifier to retrain itself (because the two views are duplicates of one another).

This type of procedure (which is an idea decades old) has been called, among other things, self-learning, self-correction, self-training and decision-directed adaptation. The problem with these approaches is that the training set so generated is biased and other assumptions are necessary for the feedback procedure to improve over the original classifier.

Of course this does not mean that the complete statistical independence assumption cannot be relaxed. The above argument only shows that at least some amount of independence is necessary.

Comments

Popular posts from this blog

BWT for NLP (2)

Incremental complexity support vector machine