SARSA

Created: 2022-06-16T20:58:24-05:00

for every episode:
initialize state
sample an action
for each step in the episode until the end of the episode:
take action and observe reward, new state
sample an action
update expected reward from state->action to:
Q(S,A) <- Q(S, A) + alpha[R + k Q(S', A') - Q(S, A)]
where alpha and k are tuning variables

Medium.com article