DeepReinforceLearning
主要介绍深度强化学习中的DQn、梯度策略与AlphaGo及AlphaGo Zero
DQN:
Use neural network 𝑄(𝑠,𝑎;𝐰) to approximate 𝑄⋆(𝑠,𝑎).to select the best/most valuable actiton in this state
Q-function to subscribe the value of action in the real world.
Temporal Difference (TD) Learning to make the action evaluation function Q become more accurate.
cause 𝑈t = 𝑅t + 𝛾⋅𝑈t+1 .
prediction 𝑄(𝑠t,𝑎t;𝐰t)
loss Lt = 1/2*[𝑄(𝑠t,𝑎t;𝐰t) - yt]^2
Gradient descent wt+1 = wt - alpha*(dLt / dw)|w=wt
DQN : off-policy like the "behavior cloning" thinking
Policy Gradient:
Neural network to approximate Value and Policy, make the value evaluation function more accurancy and the policy preform better.
𝑉(𝑠) =∑𝑎 𝜋(𝑎,𝑠)⋅𝑄𝜋(𝑠,𝑎) . contain: actor & critic
𝜋(𝑎,𝑠;𝛉) -> 𝜋(𝑎,𝑠); 𝑞(𝑠,𝑎;𝐰) -> 𝑞(𝑠,𝑎)
𝑉(𝑠;𝛉,𝐰) = ∑𝑎 𝜋(𝑎,𝑠;𝛉)𝑞(𝑠,𝑎;𝐰)
𝐠(𝑎,𝛉) = [dlog𝜋(𝑎,𝑠;𝛉)/d𝛉]•𝑞(𝑠,𝑎;𝐰)
d𝑉(𝑠;𝛉,𝐰)/d𝛉 =𝔼𝐴[𝐠(𝐴,𝛉)] {cause 𝜋 is probability density function}
as for the critic use the TD to update 𝐰
AlphaGo(MCTS)
For the Monte Carlo Tree Search trained AlphaGO Zero, it has 4 steps:
1. Selection: The player makes an action 𝑎. (Imaginary action; not actual move.)
2. Expansion: The opponent makes an action; the state updates. (Also imaginary action; made by the policy network.)
3. Evaluation: Evaluate the state-value and get score 𝑣. Play the game to the end to receive reward 𝑟. Assign score (𝑣+𝑟)/2
4. Backup: Use the score (𝑣+𝑟)/2 to action 𝑎. to update action-values.