Training data bias caused by active learning

As opposed to the traditional supervised learning setting where the labeled training data is generated (we hope) independently and identically, in active learning the learner is allowed to select points for which labels are requested.

Because it is often impossible to construct the equivalent real-world object from its feature values, almost universally, active learning is pool-based. That is we start with a large pool of unlabeled data and the learner (usually sequentially) picks the objects from the pool for which the labels are requested.

One unavoidable effect of active learning is that we end up with a biased training data set. If the true data distribution is $latex P(x,y)$, we have data drawn from some distribution $latex \hat{P}(x,y)$ (as always $latex x$ is the feature vector and $latex y$ is the class label).

We would like to correct for this bias so it does not lead to learning an incorrect classifier. And furthermore we want to use this biased data set to accurately evaluate the classifier.

In general since $latex P(x,y)$ is unknown, if $latex \hat{P}(x,y)$ is arbitrarily different from it there is nothing that can be done. However, thankfully, the bias caused by active learning is more tame.

The type of bias

Assume that marginal feature distribution of the labeled points after active learning is given by $latex \hat{P}(x) = \sum_y\hat{P}(x,y)$. Therefore $latex \hat{P}(x)$ is the putative distribution from which we can assume the feature vectors with labels have been sampled from.

For every feature vector thus sampled from $latex \hat{P}(x)$ we request its label from the oracle which returns a label according to the conditional distribution $latex P(y|x) = \frac{P(x,y)}{\sum_y P(x,y)}$.  That is there is no bias in the conditional distribution. Therefore $latex \hat{P}(x,y) = \hat{P}(x) P(y|x)$. This type of bias has been called covariate shift.

The data

After actively sampling the labels $latex n$ times, let us say we have the following data -- a biased labeled training data set $latex \{x_i, y_i\}_{i=1,\ldots,n} \sim \hat{P}(x,y)$, where the feature vectors $latex x_i$ come from the original pool of $latex M$ unlabeled feature vectors  $latex \{x_i\}_{i=1,\ldots,M} \sim P(x)$

Let us define $latex \beta=\frac{P(x,y)}{\hat{P}(x,y)}=\frac{P(x)}{\hat{P}(x)}$. If $latex \beta$ is large we expect the feature vector $latex x$ to be under-represented in the labeled data set and if it is small it is over-represented.

Now define for each labeled example $latex \beta_i=\frac{P(x_i)}{\hat{P}(x_i)}$ for $latex i = 1,\ldots,n$. If we knew the values of $latex \{\beta_i\}$ we can correct for the bias during training and evaluation.

This paper by Huang et. al., and some of its references deal with the estimation of $latex \{\beta_i\}_{i=1,\ldots,n}$. Remark: This estimation needs take into account that $latex E_{\hat{P}(x)}[\beta_i] = 1$. This implies that the sample mean of the beta values on the labeled data set should be somewhere close to unity. This constraint is explicitly imposed in the estimation method of Huang et. al.

Evaluation of the classifier

We shall first look at bias-correction for evaluation. Imagine that we are handed a classifier $latex f()$, and we are asked to use the biased labeled data set to evaluate its accuracy. Also assume that we used the above method to estimate $latex \{\beta_i\}_{i=1,\ldots,n}$. Now fixing the bias for evaluation boils down to just using a weighted average of the errors, where the weights are given by $latex \{\beta_i\}$.

If the empirical loss on the biased sample is written as $latex R = \frac{1}{n} \sum_i l(f(x_i), y_i)$, we write the estimate of the loss on the true distribution as the weighted loss $latex R_c= \frac{1}{n} \sum_i \beta_i l(f(x_i), y_i)$.

Therefore we increase the contribution of the under-represented examples, and decrease that of the over-represented examples, to the overall loss.

Learning the classifier

How can the bias be accounted for during learning? The straightforward way is to learn the classifier parameters to minimize the weighted loss $latex R_c$ (plus some regularization term) as opposed to the un-weighted empirical loss on the labeled data set.

However, a natural question that can be raised is whether any bias correction is necessary. Note that the posterior class distribution $latex P(y|x)$ is unbiased in the labeled sample. This means that any Bayes-consistent diagnostic classifier on $latex P(x,y)$ will still converge to the Bayes error rate with examples drawn from $latex \hat{P}(x,y)$.

For example imagine constructing a $latex k$-Nearest Neighbor classifier on the biased labeled dataset.  If we let $latex k \rightarrow \infty$ and $latex \frac{k}{n} \rightarrow 0$, the classifier will converge to the Bayes-optimal classifier as $latex n \rightarrow \infty$, even if $latex \hat{P}(x)$ is biased. This is somewhat paradoxical and can be explained by looking at the case of finite $latex n$.

For finite $latex n$, the classifier trades off proportionally more errors in low density regions for fewer overall errors. This means that by correcting for the bias by optimizing the weighted loss, we can obtain a lower error rate. Therefore although both the bias-corrected and un-corrected classifiers converge to the Bayes error, the former converges faster.

Comments

Popular posts from this blog

Incremental complexity support vector machine

An effective kernelization of logistic regression

The Cult of Universality in Statistical Learning Theory