Posts

Showing posts from September, 2009

BWT for NLP (1)

The Burrows-Wheeler transform (BWT), which is the main step in the bzip2 compression algorithm, is a permutation transform on a string over an ordered alphabet. It is a clever idea and can be useful for some string processing for natural language processing.  I will present one such use. BWT massages the original string into being more amenable to compression. Of course the transform doesn't alter the compressibility (entropy rate) of the original string. All it does is make the string more compressible by algorithms we know. The reason string permutation by BWT (as opposed to say sorting the string, which makes it really compressible) is useful is that the reverse transform (undoing the permutation) can be done with very little additional information. Mark Nelson wrote a nice introduction to the transform.  Moreover, the BWT essentially involves the construction of the suffix array for the string, and therefore can be done in time and space linear in the length of the string.

Incremental complexity support vector machine

Image
One of the problems with using complex kernels with support vector machines is that they tend to produce classification boundaries that are odd, like the ones below. (I generated them using a java SVM applet from here , whose reliability I cannot swear to, but have no reason to doubt.) Both SVM boundaries are with Gaussian RBF kernels: the first with $latex \sigma = 1$ and the second with $latex \sigma = 10$ on two different data sets. Note the segments of the boundary to the east of the blue examples in the bottom figure, and those to the south and to the north-east of the blue examples in the top figure. They seem to violate intuition. The reason for these anomalous boundaries is of course the large complexity of the function class induced by the RBF kernel with large $latex \sigma$, which gives the classifier a propensity to make subtle distinctions even in regions of  somewhat low example density. A possible solution: using complex kernels only where they are needed We propose to b

Training data bias caused by active learning

As opposed to the traditional supervised learning setting where the labeled training data is generated (we hope) independently and identically, in active learning the learner is allowed to select points for which labels are requested. Because it is often impossible to construct the equivalent real-world object from its feature values, almost universally, active learning is pool-based . That is we start with a large pool of unlabeled data and the learner (usually sequentially) picks the objects from the pool for which the labels are requested. One unavoidable effect of active learning is that we end up with a biased training data set. If the true data distribution is $latex P(x,y)$, we have data drawn from some distribution $latex \hat{P}(x,y)$ (as always $latex x$ is the feature vector and $latex y$ is the class label). We would like to correct for this bias so it does not lead to learning an incorrect classifier. And furthermore we want to use this biased data set to accurately evalu