## Deciding where to go: Explore/Exploit, the Multi-Armed Bandit Problem

In mission work, we face a particular kind of problem all the time: given limited resources (money, personnel, time) and a plethora of places we could go, how do we choose? For example, where are we most likely to find People of Peace, and where are we more likely to encounter hostility? do you focus on building a local discipling infrastructure, or focus on making apostolic teams that will be sent out?

We are not the only ones who face this problem. You’ll see it if you ever play the videogame Civilization: where you have to make a choice between whether your cities build local infrastructure to exploit the territory they have, or build scouts to explore the surrounding territory for new territory on which to build cities to exploit. The challenge belongs to a mathematical problem class, the “explore/exploit tradeoff,” and since there are many who face this issue – web page testing, drug trials, presidential campaigns, and the like – there’s a lot of money spent in figuring out algorithms and best practices. We can learn from some of these.

The classic math scenario in Explore/Exploit is the “Multi-Armed Bandit Problem.” I won’t go into exactly why it’s named that – you can read more in-depth – but essentially the problem is this:

Imagine walking into a casino and deciding to play the slot machines.

There’s a row of machines, each of which has a different probability of paying a reward when you pull the lever. Some machines pay more – somemuchmore – than the other machines, but you’re not sure which machine has the highest return.If you knew the best machine in advance, you’d just pull that lever all day long, but you don’t have a clue, and no one is going to tell you. The only way to find out is to start pulling levers, pay close attention, keep track of what works and what doesn’t, and do the math.

There’s a tradeoff to be made, however: when you choose to pull a lever you haven’t pulled before, you get new information about that option, and that information is valuable in finding the best overall machine. But pulling the less-tested lever has an opportunity cost: you’renotpulling the lever you currently think will give you the best return. There’s a risk that the lever you pull will return less than what you would’ve brought in pulling the current optimal lever, and that’s a very real cost.

You can see how this applies both in mission and in life. Which skill set is going to pay off the most? Which job is going to have the most opportunities? Which relationship will lead to marriage and happiness?

I’ve been reading more about this particular class of problems in a book, “Algorithms to live by.” The book goes into the math in-depth (both the explore/exploit trade off and the math behind the idea of minimizing regret), but the blog post cited above will get you started. (Also, Wikipedia has a pretty detailed entry on a variety of Multi-armed Bandit scenarios.) The math boils down to this:

- Given two choices (A & B),
- and given that you know a bit about choice A (e.g. it has a 60/40 “win” rate after a few tries) and little about B (e.g. it has a 50/50 win rate with 2 tries)
- you should probably give choice B a go. Explore it more. It could very well have a higher win rate than choice A.
- Typically, about 4 to 6 “tries” seems to be enough to get a comparable pattern between two choices, and then you settle into “exploiting” (which is just a math term for mining-the-riches-of) the more effective choice.

The Explore/Exploit problem is really about “when to settle.” It’s related to the “optimal stopping” problem. Unfortunately, we often settle a bit too early. The balance between explore/exploit needs to change over time: in the early stages of entry into a particular time, job, place, group of relationships, one needs to explore more. But, once the options are fully known, it’s time to “settle in” to one particular option.

(As a final note, “exploit” sounds bad. In the context of math, it just means “use the information” that you gathered during the exploration phase. “Finding a person to have a family with” would be the explore phase; “happy family meals” would be the exploit phase. One shouldn’t attribute a negative connotation to this particular usage of “exploit.”)

## Reply

You must be logged in to post a comment.