Posts

Showing posts with the label Natural Language Processing

Regularized Minimax for Robust Learning

Image
This post is about using minimax estimation for robust learning when the test data distribution is expected to be different from the training data distribution, i.e learning that is robust to data drift. Cost Sensitive Loss Functions Given a training data set $latex D = \{x_i, y_i\}_{i=1,\ldots,N}$, most learning algorithms learn a classifier $latex \phi$ that is parametrized by a vector $latex w$ by minimizing a loss function where $latex l(x_i, y_i, w)$ is the loss on example $latex i$ and $latex f(w)$ is some function that penalizes complexity. For example for logistic regression the loss function looks like for some $latex \lambda > 0$. If, in addition, the examples came with costs $latex c_i$ (that somehow specify the importance of minimizing the loss on that particular example), we can perform cost sensitive learning by over/under-sampling the training data or minimize a cost-weighted loss function (see this paper by Zadrozny et. al. ) We further constrain $latex \sum_i^N c_i...

BWT for NLP (2)

Image
I show how the Burrows-Wheeler Transform can be used to compute the similarity between two strings. We submitted results from this method (along with results from the Context-Chain metric developed by my colleagues Frank Schilder and Ravi Kondadadi) for the Automatically Evaluating the Summaries of Peers (AESOP) task of the TAC 2009 conference. The task was to produce an automatic metric to evaluate machine generated summaries (i.e., system summaries) against human generated summaries for the TAC '09 Update Summarization Task. Clearly the automatic metric is just some function that produces a similarity score between the system summary and the human generated (the so-called model ) summary. The  proposed metrics were evaluated by comparing their rankings of the system summaries from different peers to that of the ranking produced by human judges. Similarity Metric We use an estimate of the conditional "compressibility" of the model summary given the system summary as the...

BWT for NLP (1)

The Burrows-Wheeler transform (BWT), which is the main step in the bzip2 compression algorithm, is a permutation transform on a string over an ordered alphabet. It is a clever idea and can be useful for some string processing for natural language processing.  I will present one such use. BWT massages the original string into being more amenable to compression. Of course the transform doesn't alter the compressibility (entropy rate) of the original string. All it does is make the string more compressible by algorithms we know. The reason string permutation by BWT (as opposed to say sorting the string, which makes it really compressible) is useful is that the reverse transform (undoing the permutation) can be done with very little additional information. Mark Nelson wrote a nice introduction to the transform.  Moreover, the BWT essentially involves the construction of the suffix array for the string, and therefore can be done in time and space linear in the length of the string. ...