Created: 2022-06-16T20:58:24-05:00
for every episode: initialize state sample an action for each step in the episode until the end of the episode: take action and observe reward, new state sample an action update expected reward from state->action to: Q(S,A) <- Q(S, A) + alpha[R + k Q(S', A') - Q(S, A)] where alpha and k are tuning variables