Posts

The Cult of Universality in Statistical Learning Theory

The question is frequently raised as to why the theory and practice of machine learning are so divergent. Whereas if you glance at any article about classification, chances are that you will find symbol upon lemma & equation upon inequality, making claims about the bounds on the error rates, that should putatively guide the engineer in the solution of her problem. However, the situation seems to be that the engineer having been forewarned by her pragmatic colleagues (or having checked a few herself) that these bounds are vacuous for most realistic problems, circumvents them altogether in her search for any useful nuggets in the article. So why do these oft-ignored analyses still persist in a field that is largely comprised of engineers? From my brief survey of the literature it seems that one  (but, by no means, the only) reason is the needless preponderance of worst-case thinking . (Being a panglossian believer of the purity of science and of the intentions of its workers, I

Random Fourier Features for Kernel Density Estimation

Image
The NIPS paper Random Fourier Features for Large-scale Kernel Machines , by Rahimi and Recht presents a method for randomized feature mapping where dot products in the transformed feature space approximate (a certain class of) positive definite (p.d.) kernels in the original space. We know that for any p.d. kernel there exists a deterministic map that has the aforementioned property but it may be infinite dimensional. The paper presents results indicating that with the randomized map we  can get away with only a "small" number of features (at least for a classification setting). Before applying the method to density estimation let us review the relevant section of the paper briefly. Bochner's Theorem and Random Fourier Features Assume that we have data in $R^d$ and a continuous p.d. kernel $K(x,y)$ defined for every pair of points $x,y \in R^d$. Assume further that the kernel is shift-invariant, i.e., $K(x,y) = K(x-y) \triangleq K(\delta)$ and that the kernel is

Regularized Minimax on Synthetic Data

Image
First I would like to mention that, since my last post , I came across the paper from 2005 on Robust Supervised Learning by J. Andrew Bagnell that proposed almost exactly the same regularized minimax algorithm as the one I derived. He motivates the problem slightly differently and weights each example separately and not based on types, but the details are essentially identical. Experiments on Synthetic Data I tried the algorithm on some synthetic data and a linear logistic regression model. The results are shown in the figures below. In both examples, there are examples from two classes (red and blue). Each class is a drawn from a  mixture of two normal distributions (i.e., there are two types per class). The types are shown as red squares and red circles, and blue diamonds and blue triangles. Class-conditionally the types have a skewed distribution. There are 9 times as many red squares as red circles, and 9 times as many blue diamonds as triangles. We would expect a plain logistic

Regularized Minimax for Robust Learning

Image
This post is about using minimax estimation for robust learning when the test data distribution is expected to be different from the training data distribution, i.e learning that is robust to data drift. Cost Sensitive Loss Functions Given a training data set $latex D = \{x_i, y_i\}_{i=1,\ldots,N}$, most learning algorithms learn a classifier $latex \phi$ that is parametrized by a vector $latex w$ by minimizing a loss function where $latex l(x_i, y_i, w)$ is the loss on example $latex i$ and $latex f(w)$ is some function that penalizes complexity. For example for logistic regression the loss function looks like for some $latex \lambda > 0$. If, in addition, the examples came with costs $latex c_i$ (that somehow specify the importance of minimizing the loss on that particular example), we can perform cost sensitive learning by over/under-sampling the training data or minimize a cost-weighted loss function (see this paper by Zadrozny et. al. ) We further constrain $latex \sum_i^N c_i

Sparse online kernel logistic regression

Image
In a previous post , I talked about an idea for sparsifying kernel logistic regression by using random prototypes. I also showed how the prototypes themselves (as well as the kernel parameters) can be updated. (Update Apr 2010. Slides for a tutorial on this stuff.) (As a brief aside, I note that an essentially identical approach was used to sparsify Gaussian Process Regression by Snelson and Gharahmani . For GPR they use gradient ascent on the log-likelihood to learn the prototypes and labels, which is akin to learning the prototypes and betas for logistic regression. The set of prototypes and labels generated by their algorithm can be thought of as a pseudo training set.) I recently (with the help of my super-competent Java developer colleague Hiroko Bretz) implemented the sparse kernel logistic regression algorithm. The learning is done in an online fashion (i.e., using stochastic gradient descent). It seems to perform reasonably well on large datasets. Below I'll show its behav

BWT for NLP (2)

Image
I show how the Burrows-Wheeler Transform can be used to compute the similarity between two strings. We submitted results from this method (along with results from the Context-Chain metric developed by my colleagues Frank Schilder and Ravi Kondadadi) for the Automatically Evaluating the Summaries of Peers (AESOP) task of the TAC 2009 conference. The task was to produce an automatic metric to evaluate machine generated summaries (i.e., system summaries) against human generated summaries for the TAC '09 Update Summarization Task. Clearly the automatic metric is just some function that produces a similarity score between the system summary and the human generated (the so-called model ) summary. The  proposed metrics were evaluated by comparing their rankings of the system summaries from different peers to that of the ranking produced by human judges. Similarity Metric We use an estimate of the conditional "compressibility" of the model summary given the system summary as the