Posts

Showing posts with the label sample selection bias

Training data bias caused by active learning

As opposed to the traditional supervised learning setting where the labeled training data is generated (we hope) independently and identically, in active learning the learner is allowed to select points for which labels are requested. Because it is often impossible to construct the equivalent real-world object from its feature values, almost universally, active learning is pool-based . That is we start with a large pool of unlabeled data and the learner (usually sequentially) picks the objects from the pool for which the labels are requested. One unavoidable effect of active learning is that we end up with a biased training data set. If the true data distribution is $latex P(x,y)$, we have data drawn from some distribution $latex \hat{P}(x,y)$ (as always $latex x$ is the feature vector and $latex y$ is the class label). We would like to correct for this bias so it does not lead to learning an incorrect classifier. And furthermore we want to use this biased data set to accurately evalu...