ReinforceLearning_Note_V1

DeepReinforceLearning

主要介绍深度强化学习中的DQn、梯度策略与AlphaGo及AlphaGo Zero

DQN:

Use neural network 𝑄(𝑠,𝑎;𝐰) to approximate 𝑄⋆(𝑠,𝑎).to select the best/most valuable actiton in this state

Q-function to subscribe the value of action in the real world.

Temporal Difference (TD) Learning to make the action evaluation function Q become more accurate.

    cause	𝑈t = 𝑅t + 𝛾⋅𝑈t+1 .
    prediction 𝑄(𝑠t,𝑎t;𝐰t)
    loss	Lt = 1/2*[𝑄(𝑠t,𝑎t;𝐰t) - yt]^2
    Gradient descent wt+1 = wt - alpha*(dLt / dw)|w=wt

    DQN : off-policy like the "behavior cloning" thinking

Policy Gradient:

Neural network to approximate Value and Policy, make the value evaluation function more accurancy and the policy preform better.

    𝑉(𝑠) =∑𝑎 𝜋(𝑎,𝑠)⋅𝑄𝜋(𝑠,𝑎) . contain: actor & critic
    
    𝜋(𝑎,𝑠;𝛉) -> 𝜋(𝑎,𝑠); 𝑞(𝑠,𝑎;𝐰) -> 𝑞(𝑠,𝑎)
    
    𝑉(𝑠;𝛉,𝐰) = ∑𝑎 𝜋(𝑎,𝑠;𝛉)𝑞(𝑠,𝑎;𝐰)

    𝐠(𝑎,𝛉) = [dlog𝜋(𝑎,𝑠;𝛉)/d𝛉]•𝑞(𝑠,𝑎;𝐰)

    d𝑉(𝑠;𝛉,𝐰)/d𝛉 =𝔼𝐴[𝐠(𝐴,𝛉)] {cause 𝜋 is probability density function}

    as for the critic use the TD to update 𝐰

AlphaGo（MCTS）

For the Monte Carlo Tree Search trained AlphaGO Zero, it has 4 steps:
    1. Selection: The player makes an action 𝑎. (Imaginary action; not actual move.)

    2. Expansion: The opponent makes an action; the state updates. (Also imaginary action; made by the policy network.)

    3. Evaluation: Evaluate the state-value and get score 𝑣. Play the game to the end to receive reward 𝑟. Assign score (𝑣+𝑟)/2

    4. Backup: Use the score (𝑣+𝑟)/2 to action 𝑎. to update action-values.