一、Q-learning 介紹
增強學習 (Reinforcement learning) 是一種機器學習方法,而 Q-learning 是增強學習中一個最知名的演算法,他能解決的是以下問題:
一個能「感知環境(獎勵、懲罰)」的無監督 agent,學習如何選擇「動作」達到其最佳獎勵值。
也就是說,Q-learning 能藉由學習 action-value function,在 finite Markov decision process (MDP) 中找到選擇策略。Q learning 以數學表示如下:
${\displaystyle Q(s_{t},a_{t})\leftarrow \underbrace {Q(s_{t},a_{t})} _{\rm {old~value}}+\underbrace {\alpha } _{\rm {learning~rate}}\cdot \left(\overbrace {\underbrace {r_{t+1}} _{\rm {reward}}+\underbrace {\gamma } _{\rm {discount~factor}}\cdot \underbrace {\max _{a}Q(s_{t+1},a)} _{\rm {estimate~of~optimal~future~value}}} ^{\rm {learned~value}}-\underbrace {Q(s_{t},a_{t})} _{\rm {old~value}}\right)}$
- Learning rate : 新資訊的重要程度,和一般機器學習的 learning rate 意義一樣
- Discount factor : 代表未來獎勵的重要程度
- 0 : 看當前,短視近利
- 1 : 看長期,需要注意的是,接近超過1時容易發散
這是一個 Q-learning 的簡單教學,例子非常簡明易懂!
A Painless Q-learning Tutorial (一個 Q-learning 算法的簡明教程)
http://blog.csdn.net/itplus/article/details/9361915
二、用 Q-learning 解決木棒平衡問題 (cart-pole system)
木棒台車平衡問題 (cart-pole system) 是一個知名的 Q-learning 例子,給定木棒的角度和台車位置及其兩者的變化量,決定讓台車向左或向右來平衡木棒。
在 Neuron-like adaptive elements that can solve difficult learning control problems 這篇論文中,使用兩個類神經網路結構,配合 Q-learning 演算法,可以解決困難的台車木棒平衡問題。整個系統結構如下:
- 盒系統 (box system) : 把4維的輸入狀態向量 (x, x_dot, theta, theta_dot) 轉成162維的狀態向量,這裡使用的 x, theta 區分值 ( 如 x: ±0.8, ±2.4 m ) 是根據過去的理論經驗。
- 評價系統 (adaptive critic element, ACE) : 產生一個比較好的 reward 訊號,使用 Temporal Difference (TD) 方法。
- 動作系統 (associative search element, ASE) : 使用 box 產生的狀態向量和較好的 reward,產生動作 (向左/右)。
評價系統 (adaptive critic element, ACE) 實作細節
def ACE(learn, decay, reward, gamma, p_before): # ACE : generate [improved reinforcement signal (reward_hat)] global v, x_vec, x_bar if reward == -1: p = 0 else: p = v.dot(x_vec) reward_hat = reward + gamma*p - p_before v += learn * reward_hat * x_bar x_bar = decay*x_bar + (1-decay)*x_vec return reward_hat, p
動作系統 (associative search element, ASE) 實作細節
def ASE(learn, decay, reward): # ASE : generate [action] global w, x_vec, e sigma = 0.01 noise = sigma*np.random.randn() y = Activation(w.dot(x_vec) + noise) w += learn * reward * e e = decay*e + (1-decay)*(y*2-1)*x_vec return y
[用心去感覺] 缺一不可
盒系統 和 ACE 皆可拆卸,系統仍然能運行,但經過測試結果都是不好的 (不能生存超過 500 次),所以利用領域知識擴展狀態向量維度和動態調整 reward 訊號都是成功的關鍵。
三、OpenAI Gym 輕鬆玩
OpenAI Gym 是一個增強學習的工具包,提供各種經典的學習環境,讓大家可以輕鬆模擬分享自己的增強學習演算法。下面是我的模擬結果,扭來扭去不會倒,很好玩XD
還有很多其他人的解法可以在 OpenAI gym 平台觀看,還有很多其他有趣的問題!
https://gym.openai.com/envs/CartPole-v0
四、程式碼與執行結果
我寫的python版本 ( 含OpenAI gym模擬 ):
pole.py
import numpy as np import gym from gym.wrappers import Monitor BOX_DIM = 162 MAX_STEPS = 100000 TRAILS = 200 x_vec = np.zeros(BOX_DIM) # state vector w = np.zeros(BOX_DIM) # action weights v = np.zeros(BOX_DIM) # critic weights e = np.zeros(BOX_DIM) # action weight eligibilities x_bar = np.zeros(BOX_DIM) # critic weight eligibilities def Activation(x): # Activation function : [step function] if x >= 0: return 1 else: return 0 def ACE(learn, decay, reward, gamma, p_before): # ACE : generate [improved reinforcement signal (reward_hat)] global v, x_vec, x_bar if reward == -1: p = 0 else: p = v.dot(x_vec) reward_hat = reward + gamma*p - p_before v += learn * reward_hat * x_bar x_bar = decay*x_bar + (1-decay)*x_vec return reward_hat, p def ASE(learn, decay, reward): # ASE : generate [action] global w, x_vec, e sigma = 0.01 noise = sigma*np.random.randn() y = Activation(w.dot(x_vec) + noise) w += learn * reward * e e = decay*e + (1-decay)*(y*2-1)*x_vec return y def Box(ob): # box system : [4-dim state] to [162-dim state] x, x_dot, theta, theta_dot = ob box = 0 one_degree = 0.0174532 six_degrees = 0.1047192 twelve_degrees = 0.2094384 fifty_degrees = 0.87266 if x < -2.4 or x > 2.4 or theta < -1*twelve_degrees or theta > twelve_degrees : return Box([0,0,0,0]) if x < -0.8 : box = 0 elif x < 0.8 : box = 1 else: box = 2 if x_dot < -0.5 : box = box elif x_dot < 0.5 : box += 3 else: box += 6 if theta < -1*six_degrees : box = box elif theta < -1*one_degree : box += 9 elif theta < 0: box += 18 elif theta < one_degree : box += 27 elif theta < six_degrees : box += 36 else : box += 45 if theta_dot < -fifty_degrees : box = box elif theta_dot < fifty_degrees : box += 54 else : box += 108 state = np.zeros(BOX_DIM) state[box] = 1 return state ### Simulation by using OpenAI Gym ### https://gym.openai.com/docs env = gym.make('CartPole-v0') # env = Monitor(env, '/home/jack/Desktop/cart-pole', force=True) for i in range(0, TRAILS): ob = env.reset() p_before = 0 for j in range(0, MAX_STEPS): x_vec = Box(ob) reward_hat, p_before = ACE(learn=0.5, decay=0.8, reward=0, gamma=0.95, p_before=p_before) action = ASE(learn=1000, decay=0.9, reward=reward_hat) if j > 30000: env.render() ob, _, done, _ = env.step(action) if done: x_vec = Box(ob) reward_hat, p_before = ACE(learn=0.5, decay=0.8, reward=-1, gamma=0.95, p_before=p_before) ASE(learn=1000, decay=0.9, reward=reward_hat) break if i % 10 == 0 : print("Trial {0:3} was {1:5} steps".format(i, j)) if j == MAX_STEPS-1 : print("Pole balanced successfully for at least {} steps at Trail {}".format(MAX_STEPS, i)) break # env.close()
執行結果
[2016-12-31 16:18:56,822] Making new env: CartPole-v0 Trial 0 was 13 steps Trial 10 was 17 steps Trial 20 was 594 steps Trial 30 was 4656 steps Trial 40 was 2940 steps Trial 50 was 4715 steps Pole balanced successfully for at least 100000 steps at Trail 58
[補記] gym env 200 次限制 (2017.12.9.)
gym 模擬器的模擬暫存現在預設次數最大只能到 200,因此要有如上的幾萬次模擬需自己改變模擬器設定 (https://github.com/openai/gym/issues/463),一個可行的方法如下:
gym.envs.register( id='CartPole-v3', entry_point='gym.envs.classic_control:CartPoleEnv', tags={'wrapper_config.TimeLimit.max_episode_steps': 50000}, reward_threshold=195.0, ) env = gym.make('CartPole-v3')
p.s. 感謝黃網友指正:)
References
[1] Barto, A.G., Sutton, R.S., and Anderson, C. (1983). Neuron-like adaptive elements that can solve difficult learning control problems, IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC-13, no. 5, pp. 834-846.
http://www.derongliu.org/adp/adp-cdrom/Barto1983.pdf
wiki - Q-learning
https://en.wikipedia.org/wiki/Q-learning
A Painless Q-learning Tutorial (一個 Q-learning 算法的簡明教程)
http://blog.csdn.net/itplus/article/details/9361915
OpenAI gym
https://gym.openai.com/