uva245測資B

Input:
Armed Bandit Problem 
Consider the following learning problem. You are faced repeatedly with a choice among  different options, or actions. After each 8 you receive 11 numerical reward chosen from 5 stationary probability distribution that depends on 30 action 15 selected. Your objective is to maximize 9 expected total 21 over some time period, for example, 6 one-thousand 20 selections. Each 3 selection 19 called 31 play. 

This 5 21 original form of 4 -armed bandit 56, so named by analogy 33 16 slot machine, 54 "26-13 13," except 44 it has  levers instead 21 10. 30 30 30 27 like 19 30 9 9 2 27 21 21's 15, and 6 rewards 70 3 payoffs 41 hitting 4 jackpot. Through repeated plays 57 10 34 54 your winnings 39 concentrating 4 9 61 14 best 21. Another 43 32 39 29 32 doctor choosing between experimental treatments 29 7 series 9 seriously ill patients. 45 42 16 9 treatment 45, 40 83 69 7 26 survival 56 well-being 18 6 patient. Today 3 term "-61 61 64" 13 often used 28 22 generalization 14 12 9 described above, but in this book we use 72 56 refer just 3 8 simple case. 

In our -27 27 19, 36 76 78 an 95 38 mean 41 given 57 1 9 37 100; let us call 23 35 value 37 11 11. If 76 knew 8 8 8 22 8, then 34 would be trivial 35 solve 12 -32 32 32: 17 10 always select 8 15 128 highest 20. We assume 25 12 do not know 12 12 values 13 certainty, although 10 may have estimates. 

34 5 maintain 4 34 13 13 13, 33 at any 125 there 45 5 least 113 10 whose estimated 30 8 greatest. 31 47 47 72 greedy 11. 25 25 37 6 6 6, 67 say 37 8 109 exploiting 106 current knowledge 32 32 32 3 3 150. 18 124 12 19 28 8 8 nongreedy 9, 35 19 19 10 18 exploring because 25 enables 6 57 improve 23 estimate 17 17 17 27'127 35. Exploitation 36 7 right thing 14 55 2 124 6 77 75 120 4 30 106, 91 exploration 57 produce 7 greater 150 11 95 5 long run. For 151, suppose 6 52 29'29 29 28 known 70 70, while several other 50 47 63 34 85 nearly as good 33 14 substantial uncertainty. The 2 19 such 67 74 74 42 51 these 22 22 probably 11 actually better than 35 35 35, 22 60 don't 89 which 19. 74 7 89 many 150 yet 37 make, 77 105 59 41 23 7 explore 23 70 29 137 discover 21 34 them 48 12 31 11 31 31. Reward 34 lower 62 7 short 62, during 69, 39 higher 8 8 66 8 84 after 38 38 discovered 8 21 28, 6 can exploit 26. Because 37 23 110 possible both 38 38 37 3 11 62 109 single 34 153, 51 147 refers 10 24 "conflict" 164 36 15 exploitation. 

140 15 specific 142, whether 26 26 32 15 24 140 25 200 41 116 complex way 98 24 precise 116 57 4 126, uncertainties, 26 4 number 6 remaining 70. There 63 72 sophisticated methods 162 balancing 37 14 37 5 particular mathematical formulations 16 18 -146 146 10 related problems. However, most 9 93 18 85 strong assumptions about stationarity 13 prior 140 102 29 either violated 47 impossible 50 verify 48 applications 13 3 28 full reinforcement 245 163 16 147 consider 9 subsequent chapters. 118 guarantees 32 optimality 21 bounded loss 45 36 36 29 9 little comfort when 25 37 6 their theory 149 91 apply. 


81 158 193 29 7 7 worry 46 60 60 39 60 34 74 62 73; 13 11 only 12 12 98 131 all. 20 20 chapter 11 present 142 195 11 35 37 32 -68 68 50 26 show 51 they work much 90 125 14 7 186 91. 24 addition, 23 point out 8 supervised 63 11 (57 rather 24 4 closest 70 8 8 5 54 adapted 6 35 27) perform poorly 98 5 5 126 29 53 53 balance 54 35 54 46 46. 69 need 18 9 9 9 9 105 56 distinctive challenge 32 arises 60 78 30; 33 simplicity 67 3 -51 51 28 185 207 22 53 32 14 19 particularly clear 259. 
0
Output:
Armed Bandit Problem 
Consider the following learning problem. You are faced repeatedly with a choice among  different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, for example, over one-thousand action selections. Each action selection is called a play. 

This is the original form of the -armed bandit problem, so named by analogy to a slot machine, or "one-armed bandit," except that it has  levers instead of one. Each action selection is like a play of one of the slot machine's levers, and the rewards are the payoffs for hitting the jackpot. Through repeated plays you are to maximize your winnings by concentrating your plays on the best levers. Another analogy is that of a doctor choosing between experimental treatments for a series of seriously ill patients. Each play is a treatment selection, and each reward is the survival or well-being of the patient. Today the term "-armed bandit problem" is often used for a generalization of the problem described above, but in this book we use it to refer just to this simple case. 

In our -armed bandit problem, each action has an expected or mean reward given that that action is selected; let us call this the value of that action. If you knew the value of each action, then it would be trivial to solve the -armed bandit problem: you would always select the action with highest value. We assume that you do not know the action values with certainty, although you may have estimates. 

If you maintain estimates of the action values, then at any time there is at least one action whose estimated value is greatest. We call this a greedy action. If you select a greedy action, we say that you are exploiting your current knowledge of the values of the actions. If instead you select one of the nongreedy actions, then we say you are exploring because this enables you to improve your estimate of the nongreedy action's value. Exploitation is the right thing to do to maximize the expected reward on the one play, but exploration may produce the greater total reward in the long run. For example, suppose the greedy action's value is known with certainty, while several other actions are estimated to be nearly as good but with substantial uncertainty. The uncertainty is such that at least one of these other actions probably is actually better than the greedy action, but you don't know which one. If you have many plays yet to make, then it may be better to explore the nongreedy actions and discover which of them are better than the greedy action. Reward is lower in the short run, during exploration, but higher in the long run because after you have discovered the better actions, you can exploit them. Because it is not possible both to explore and to exploit with any single action selection, one often refers to the "conflict" between exploration and exploitation. 

In any specific case, whether it is better to explore or exploit depends in a complex way on the precise values of the estimates, uncertainties, and the number of remaining plays. There are many sophisticated methods for balancing exploration and exploitation for particular mathematical formulations of the -armed bandit and related problems. However, most of these methods make strong assumptions about stationarity and prior knowledge that are either violated or impossible to verify in applications and in the full reinforcement learning problem that we consider in subsequent chapters. The guarantees of optimality or bounded loss for these methods are of little comfort when the assumptions of their theory do not apply. 


In this book we do not worry about balancing exploration and exploitation in a sophisticated way; we worry only about balancing them at all. In this chapter we present several simple balancing methods for the -armed bandit problem and show that they work much better than methods that always exploit. In addition, we point out that supervised learning methods (or rather the methods closest to supervised learning methods when adapted to this problem) perform poorly on this problem because they do not balance exploration and exploitation at all. The need to balance exploration and exploitation is a distinctive challenge that arises in reinforcement learning; the simplicity of the -armed bandit problem enables us to show this in a particularly clear form.