Regularized Minimax for Robust Learning
This post is about using minimax estimation for robust learning when the test data distribution is expected to be different from the training data distribution, i.e learning that is robust to data drift.
Cost Sensitive Loss Functions
Given a training data set latexD={xi,yi}i=1,…,N, most learning algorithms learn a classifier latexϕ that is parametrized by a vector latexw by minimizing a loss function
where latexl(xi,yi,w) is the loss on example latexi and latexf(w) is some function that penalizes complexity. For example for logistic regression the loss function looks like
for some latexλ>0.
If, in addition, the examples came with costs latexci (that somehow specify the importance of minimizing the loss on that particular example), we can perform cost sensitive learning by over/under-sampling the training data or minimize a cost-weighted loss function (see this paper by Zadrozny et. al. )

We further constrain latex∑Nici=N and latexci≥0. So the unweighted learning problem corresponds to the case where all latexci=1.
A Game Against An Adversary
Assume that the learner is playing a game against an adversary that will assign the costs latex{ci}i=1,…,N to the training examples that will lead to the worst possible loss for any weight vector the learner produces.
How do we learn in order to minimize this maximum possible loss? The solution is to look for the the minimax solution

For any realistic learning problem the above optimization problem does not have a unique solution.
Instead, let us assume that the adversary has to pay a price for assigning his costs, which depends upon how much they deviate from uniform. One way is to make the price proportional to the negative of the entropy of the cost distribution.
We define

where latexH(c)=−∑icilogci (the Shannon entropy of the cost vector, save the normalization to sum to one).
The new minimax optimization problem can be posed as

subject to the constraints

Note that the regularization term on the cost vector latexc essentially restricts the set of possible cost vectors the adversary has at his disposal.
Optimization
For convex loss functions (such as the logistic loss) latexL(w,c) is convex in latexw for a fixed cost assignment, therefore so is latexR(w,c). Furthermore, latexR(w,c) is concave in latexc and is restricted to a convex and compact set. We can therefore apply Danskin's theorem to perform the optimization.
The theorem allows us to say that, for a fixed weight vector latexw, if

and if latex˜c is unique, then

even though latex˜c is a function of latexw.
Algorithm
The algorithm is very simple. Perform until convergence the following
1. At the latexkth iteration, for the weight vector latexwk find the cost vector latex˜c the maximizes latexR(wk,c).
2. Update latexwk+1=wk−η∇wR(wk,˜c), where latexη is the learning rate.
The maximization in step 1 is also simple and can be shown to be

As expected, if latexγ→∞, the costs remain close to one and as latexγ→0 the entire cost budget is allocated to the example with the largest loss.
Of types and tokens
This line of work was motivated by the following intuition of my colleague Marc Light about the burstiness of types in language data.
For named entity recognition the training data is often drawn from a small time window and is likely to contain entity types whose distribution is not representative of the data that the recognizer is going see in general.
(The fact that 'Joe Plumber" occurs so frequently in our data is because we were unlucky enough to collect annotated data in 2008.)
We can build a recognizer that is robust to such misfortunes by optimizing for the worst possible type distribution rather than for the observed token distribution. One way to accomplish this is to learn the classifier by minimax over the cost assignments for different types.
For type latext let latexSt be the set of all tokens of that type and latexNt be the number of tokens of that type. We now estimate latexw by

under the same constraints on latexc as above. Here latexq is the observed type distribution in the training data and latexKL(.‖.) is the KL-divergence.
The algorithm is identical to the one above except the maximum over latexc for a fixed latexw is slightly different.
Related Work and Discussion
1. The only other work I am aware of that optimizes for a similar notion of robustness is the one on adversarial view for covariate shift by Globerson et. al. and the NIPS paper by Bruckner and Scheffer. Both these papers deal with minimax learning for robustness to additive transformation of feature vectors (or addition/deletion of features). Although it is an obvious extension, I have not seen the regularization term that restricts the domain for the cost vectors. I think it allows for learning models that are not overly pessimistic.
2. If each class is considered to one type, the usual Duda & Hart kind of minimax over class priors can be obtained. Minimax estimation is usually done for optimizing for the worst possible prior over the parameter vectors (latexw for us) and not for the costs over the examples.
3. For named entity recognition, the choice of how to group examples by types is interesting and requires further theory and experimentation.
4. For information retrieval often the ranker is learned from several example queries. The learning algorithm tries to obtain a ranker that matches human judgments for the document collection for the example queries. Since the queries are usually sampled from the query logs, the learned ranker may perform poorly for a particular user. Such a minimax approach may be suitable for optimizing for the worst possible assignment of costs over query types.
In the next post I will present some experimental results on toy examples with synthetic data.
Acknowledgment
I am very grateful to Michael Bruckner for clarifying his NIPS paper and some points about the applicability of Danskin's theorem, and to Marc Light for suggesting the problem.
Cost Sensitive Loss Functions
Given a training data set latexD={xi,yi}i=1,…,N, most learning algorithms learn a classifier latexϕ that is parametrized by a vector latexw by minimizing a loss function
where latexl(xi,yi,w) is the loss on example latexi and latexf(w) is some function that penalizes complexity. For example for logistic regression the loss function looks like
for some latexλ>0.
If, in addition, the examples came with costs latexci (that somehow specify the importance of minimizing the loss on that particular example), we can perform cost sensitive learning by over/under-sampling the training data or minimize a cost-weighted loss function (see this paper by Zadrozny et. al. )
We further constrain latex∑Nici=N and latexci≥0. So the unweighted learning problem corresponds to the case where all latexci=1.
A Game Against An Adversary
Assume that the learner is playing a game against an adversary that will assign the costs latex{ci}i=1,…,N to the training examples that will lead to the worst possible loss for any weight vector the learner produces.
How do we learn in order to minimize this maximum possible loss? The solution is to look for the the minimax solution
For any realistic learning problem the above optimization problem does not have a unique solution.
Instead, let us assume that the adversary has to pay a price for assigning his costs, which depends upon how much they deviate from uniform. One way is to make the price proportional to the negative of the entropy of the cost distribution.
We define
where latexH(c)=−∑icilogci (the Shannon entropy of the cost vector, save the normalization to sum to one).
The new minimax optimization problem can be posed as
subject to the constraints
Note that the regularization term on the cost vector latexc essentially restricts the set of possible cost vectors the adversary has at his disposal.
Optimization
For convex loss functions (such as the logistic loss) latexL(w,c) is convex in latexw for a fixed cost assignment, therefore so is latexR(w,c). Furthermore, latexR(w,c) is concave in latexc and is restricted to a convex and compact set. We can therefore apply Danskin's theorem to perform the optimization.
The theorem allows us to say that, for a fixed weight vector latexw, if
and if latex˜c is unique, then
even though latex˜c is a function of latexw.
Algorithm
The algorithm is very simple. Perform until convergence the following
1. At the latexkth iteration, for the weight vector latexwk find the cost vector latex˜c the maximizes latexR(wk,c).
2. Update latexwk+1=wk−η∇wR(wk,˜c), where latexη is the learning rate.
The maximization in step 1 is also simple and can be shown to be
As expected, if latexγ→∞, the costs remain close to one and as latexγ→0 the entire cost budget is allocated to the example with the largest loss.
Of types and tokens
This line of work was motivated by the following intuition of my colleague Marc Light about the burstiness of types in language data.
For named entity recognition the training data is often drawn from a small time window and is likely to contain entity types whose distribution is not representative of the data that the recognizer is going see in general.
(The fact that 'Joe Plumber" occurs so frequently in our data is because we were unlucky enough to collect annotated data in 2008.)
We can build a recognizer that is robust to such misfortunes by optimizing for the worst possible type distribution rather than for the observed token distribution. One way to accomplish this is to learn the classifier by minimax over the cost assignments for different types.
For type latext let latexSt be the set of all tokens of that type and latexNt be the number of tokens of that type. We now estimate latexw by
under the same constraints on latexc as above. Here latexq is the observed type distribution in the training data and latexKL(.‖.) is the KL-divergence.
The algorithm is identical to the one above except the maximum over latexc for a fixed latexw is slightly different.
Related Work and Discussion
1. The only other work I am aware of that optimizes for a similar notion of robustness is the one on adversarial view for covariate shift by Globerson et. al. and the NIPS paper by Bruckner and Scheffer. Both these papers deal with minimax learning for robustness to additive transformation of feature vectors (or addition/deletion of features). Although it is an obvious extension, I have not seen the regularization term that restricts the domain for the cost vectors. I think it allows for learning models that are not overly pessimistic.
2. If each class is considered to one type, the usual Duda & Hart kind of minimax over class priors can be obtained. Minimax estimation is usually done for optimizing for the worst possible prior over the parameter vectors (latexw for us) and not for the costs over the examples.
3. For named entity recognition, the choice of how to group examples by types is interesting and requires further theory and experimentation.
4. For information retrieval often the ranker is learned from several example queries. The learning algorithm tries to obtain a ranker that matches human judgments for the document collection for the example queries. Since the queries are usually sampled from the query logs, the learned ranker may perform poorly for a particular user. Such a minimax approach may be suitable for optimizing for the worst possible assignment of costs over query types.
In the next post I will present some experimental results on toy examples with synthetic data.
Acknowledgment
I am very grateful to Michael Bruckner for clarifying his NIPS paper and some points about the applicability of Danskin's theorem, and to Marc Light for suggesting the problem.
Hi, there is evidence that in cost sensitive AdaBoost, the objective ∑iciℓ(xi,yi,wi) seems not work, possibly due to the exponential form of \ell() such that the multiplicative weight c_i is quickly "absorbed" by AdaBoost procedure and the final classification boundary produced is almost identical to that of normal AdaBoost (by tuning the decision threshold for the classifier output).
ReplyDeleteAny comments for this phenomenon?