The Cult of Universality in Statistical Learning Theory
The question is frequently raised as to why the theory and practice of machine learning are so divergent. Whereas if you glance at any article about classification, chances are that you will find symbol upon lemma & equation upon inequality, making claims about the bounds on the error rates, that should putatively guide the engineer in the solution of her problem.
However, the situation seems to be that the engineer having been forewarned by her pragmatic colleagues (or having checked a few herself) that these bounds are vacuous for most realistic problems, circumvents them altogether in her search for any useful nuggets in the article.
So why do these oft-ignored analyses still persist in a field that is largely comprised of engineers? From my brief survey of the literature it seems that one (but, by no means, the only) reason is the needless preponderance of worst-case thinking. (Being a panglossian believer of the purity of science and of the intentions of its workers, I am immediately dismissing the cynical suggestion that these analyses are appended to an article only to intimidate the insecure reviewer.)
The cult of universality
An inventive engineer designs a learning algorithm for her problem of classifying birds from the recordings of their calls. She suspects that her algorithm is more generally applicable and sits down to analyze it formally. She vaguely recalls various neat generalization error bounds she learned about during her days at the university, and wonders if they are applicable.
The bounds made claims of the kind
"for my classifier whose complexity is $ c$, if trained on $m$ examples, then for any distribution that generated the data, it is guaranteed that the
generalization error rate $\leq$ error rate on the training set + some function of (c,m)
with high probability".
Some widely used measures of the complexity of a classifier are its VC dimension and its Rademacher complexity, both of which measure the ability of the classifier to separate any training set. The intuition is that if the classifier can imitate any arbitrary labeling of a set of vectors, it will generalize poorly.
Because of the phrase "for any distribution" in the statement of the bound, the bound is said to be universally applicable. It is this pursuit of universality which is a deplorable manifestation of worst-case thinking. It is tolerable in mathematicians that delight in pathologies, but can be debilitating in engineers.
The extent of pessimism induced by the requirement of universality is not well appreciated. The following example is designed to illustrate this by relaxing the requirement from "any distribution" to "any smooth distribution", which is not much of a relaxation at all.
Assume that I have a small training data set $\{(x_i, y_i)\}$ in $R^d$ drawn from a continuous distribution $p(x, y)$. Assume further that $ p(x)$ is reasonably smooth.
I now build a linear classifier under some loss (say an SVM). I then take all the training examples that are misclassified by the linear classifier and memorize them along with their labels.
For a test vector $ x$, if $ x$ is within $\epsilon$ of a memorized training example I give it the label of the training example. Otherwise I use the linear classifier to obtain my prediction.
I can make $ \epsilon$ very small and since the training examples will be in general position with probability one, this classification scheme is unambiguous.
This classifier will have zero error on all training sets and therefore will have high complexity according to the usual complexity measures like VC, Rademacher etc. However, if I ignore the contribution of the memorized points (which only play a role for a set of vanishingly small probability), I have a linear classifier.
Therefore, although it is reasonable to expect any analysis to yield very similar bounds on the generalization error for a linear classifier and my linear+memorization classifier, the requirement of universality leads to vacuous bounds for the latter.
Even if I assume nothing more than smoothness, I do not know how to derive reasonable statements with the existing tools. And we almost always know much more about the data distributions!
To reiterate, checking one's learning algorithm against the worst possible distribution is akin to designing a bicycle and checking how well it serves for holding up one's pants.
"The medicine bottle rules"
Our engineer ponders these issues, muses about the "no free lunch" results that imply that for any two classifiers there are distributions for which either one of them is better than the other, and wonders about the philosophical distinction between a priori restricting the function space that learning algorithm searches in, and a priori restricting the distributions that the learning algorithm is applicable for.
After a short nap, she decides on a sensible route for her analysis.
1. State the restrictions on the distribution. She shows that her algorithm will perform very well if her assumptions of the data distribution are satisfied. She further argues that the allowed distributions are still broad enough to cover many other problems.
2. State to what extent the assumptions can be violated. She analyzes how the quality of her algorithm degrades when the assumptions are satisfied only approximately.
3. State which assumptions are necessary. She analyzes the situations where her algorithm will definitely fail.
I believe that these are good rules to follow while analyzing classification algorithms. My professor George Nagy calls these the medicine bottle rules, because like on medicine label, we require information on how to administer the drug, what it is for, what is bad for, and perhaps on interesting side effects.
I do not claim to follow this advice unfailingly and I admit to some of the above crimes. I, however, do believe that medicine bottle analysis is vastly more useful than much of what passes for learning theory. I look forward to hearing from you, nimble reader, of your thoughts on the kinds of analyses you would care enough about to read.
However, the situation seems to be that the engineer having been forewarned by her pragmatic colleagues (or having checked a few herself) that these bounds are vacuous for most realistic problems, circumvents them altogether in her search for any useful nuggets in the article.
So why do these oft-ignored analyses still persist in a field that is largely comprised of engineers? From my brief survey of the literature it seems that one (but, by no means, the only) reason is the needless preponderance of worst-case thinking. (Being a panglossian believer of the purity of science and of the intentions of its workers, I am immediately dismissing the cynical suggestion that these analyses are appended to an article only to intimidate the insecure reviewer.)
The cult of universality
An inventive engineer designs a learning algorithm for her problem of classifying birds from the recordings of their calls. She suspects that her algorithm is more generally applicable and sits down to analyze it formally. She vaguely recalls various neat generalization error bounds she learned about during her days at the university, and wonders if they are applicable.
The bounds made claims of the kind
"for my classifier whose complexity is $ c$, if trained on $m$ examples, then for any distribution that generated the data, it is guaranteed that the
generalization error rate $\leq$ error rate on the training set + some function of (c,m)
with high probability".
Some widely used measures of the complexity of a classifier are its VC dimension and its Rademacher complexity, both of which measure the ability of the classifier to separate any training set. The intuition is that if the classifier can imitate any arbitrary labeling of a set of vectors, it will generalize poorly.
Because of the phrase "for any distribution" in the statement of the bound, the bound is said to be universally applicable. It is this pursuit of universality which is a deplorable manifestation of worst-case thinking. It is tolerable in mathematicians that delight in pathologies, but can be debilitating in engineers.
The extent of pessimism induced by the requirement of universality is not well appreciated. The following example is designed to illustrate this by relaxing the requirement from "any distribution" to "any smooth distribution", which is not much of a relaxation at all.
Assume that I have a small training data set $\{(x_i, y_i)\}$ in $R^d$ drawn from a continuous distribution $p(x, y)$. Assume further that $ p(x)$ is reasonably smooth.
I now build a linear classifier under some loss (say an SVM). I then take all the training examples that are misclassified by the linear classifier and memorize them along with their labels.
For a test vector $ x$, if $ x$ is within $\epsilon$ of a memorized training example I give it the label of the training example. Otherwise I use the linear classifier to obtain my prediction.
I can make $ \epsilon$ very small and since the training examples will be in general position with probability one, this classification scheme is unambiguous.
This classifier will have zero error on all training sets and therefore will have high complexity according to the usual complexity measures like VC, Rademacher etc. However, if I ignore the contribution of the memorized points (which only play a role for a set of vanishingly small probability), I have a linear classifier.
Therefore, although it is reasonable to expect any analysis to yield very similar bounds on the generalization error for a linear classifier and my linear+memorization classifier, the requirement of universality leads to vacuous bounds for the latter.
Even if I assume nothing more than smoothness, I do not know how to derive reasonable statements with the existing tools. And we almost always know much more about the data distributions!
To reiterate, checking one's learning algorithm against the worst possible distribution is akin to designing a bicycle and checking how well it serves for holding up one's pants.
"The medicine bottle rules"
Our engineer ponders these issues, muses about the "no free lunch" results that imply that for any two classifiers there are distributions for which either one of them is better than the other, and wonders about the philosophical distinction between a priori restricting the function space that learning algorithm searches in, and a priori restricting the distributions that the learning algorithm is applicable for.
After a short nap, she decides on a sensible route for her analysis.
1. State the restrictions on the distribution. She shows that her algorithm will perform very well if her assumptions of the data distribution are satisfied. She further argues that the allowed distributions are still broad enough to cover many other problems.
2. State to what extent the assumptions can be violated. She analyzes how the quality of her algorithm degrades when the assumptions are satisfied only approximately.
3. State which assumptions are necessary. She analyzes the situations where her algorithm will definitely fail.
I believe that these are good rules to follow while analyzing classification algorithms. My professor George Nagy calls these the medicine bottle rules, because like on medicine label, we require information on how to administer the drug, what it is for, what is bad for, and perhaps on interesting side effects.
I do not claim to follow this advice unfailingly and I admit to some of the above crimes. I, however, do believe that medicine bottle analysis is vastly more useful than much of what passes for learning theory. I look forward to hearing from you, nimble reader, of your thoughts on the kinds of analyses you would care enough about to read.
Yeah. As a fellow industry worker looking for practical learners, I tend to (nay, pretty much always) ignore the theoretical bounds of my setup. They're just not useful.
ReplyDeleteI agree in particular most strongly with your Medicine Bottle Rule #2; there can be a big operational difference between two algorithms that claim identical results when their inputs are independent, if one degrades well when dependence creeps in & the other degrades poorly.
For a long time it was a mystery why Naïve Bayes tends to work well even when its assumptions are violated, though the Domingos/Pazzani paper helped people understand why. But I think everyone who uses NB still ignores the new, relaxed conditions - the popular takeaway from their paper seems to have become "don't worry, it'll probably work fine", not some new understanding of the optimal conditions and how to use them in practice.
The plebes' method for checking performance is of course data-based cross-validation, and it works great; though I wish there was more general agreement about when to use its various forms (5-fold, 10-fold, leave-one-out, time-based progressive training, etc.) to answer different questions about your data and/or algorithm.
Nice post. I was going to leave a comment but it turned into an entire post of my own instead: http://mark.reid.name/iem/a-universality-cultist-responds.html
ReplyDeleteMark,
ReplyDeleteI actually started thinking of the linear+memorization classifier after reading Pinker's "Words and Rules" -- so the example wasn't invented with the sole purpose of rocking boats. But that is not relevant to the discussion.
My main point is this. If I build a learning algorithm that exploits some prior knowledge about the data distribution, then the algorithm might do poorly on distributions outside my restricted set. But no matter what the universal bounds say in this situation my learner would still be better for my problem than another which has better universal bounds (stipulating that my prior is defensible).
Since there is always something known (if nothing else, the fact that the distribution is smooth), what is the universal bound saying that cannot be better said by admitting that my learner works well only for a restricted set of distributions?
The universal bounds may be interesting for exploring the philosophy of learning but, for any real problems I can think about, it would be much more instructive to analyze a learner with respect to the restrictions on the data distribution under which it would be expected to do well.
-Harsha