The Contextual Bandits Problem: A Fast, Simple, and Optimal Algorithm
Microsoft Research (NYC)
We study the general problem of how to learn through experience to make intelligent decisions. In this setting, called the contextual bandits problem, the learner must repeatedly decide which action to take in response to an observed context, and is then permitted to observe the received reward, but only for the chosen action. The goal is to learn through experience to behave nearly as well as the best policy (or decision rule) in some possibly very large and rich space of candidate policies. Previous approaches to this problem were all highly inefficient and often extremely complicated. In this work, we present a fast and simple algorithm that learns to behave as well as the best policy at a rate that is (almost) statistically optimal. Our approach assumes access to a kind of “oracle” (or subroutine) for classification learning problems which can be used to select policies; in practice, most off-the-shelf classification algorithms could be used for this purpose. Our algorithm makes very modest use of the oracle, which it calls far less than once per round, on average, a huge improvement over previous methods. These properties suggest this may be the most practical contextual bandits algorithm among all existing approaches that are provably effective for general policy classes.
This is joint work with Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford and Lihong Li.