Consider an unknown MDP with three states (A, B and C) and two actions (+- and →). Suppose the agent chooses actions acc

Post by **answerhappygod** » Fri May 20, 2022 4:49 pm

: Consider An Unknown Mdp With Three States A B And C And Two Actions And Suppose The Agent Chooses Actions Acc 1 (122.3 KiB) Viewed 35 times

Consider an unknown MDP with three states (A, B and C) and two actions (+- and →). Suppose the agent chooses actions according to some policy in the unknown MDP, collecting a dataset consisting of samples (s, a, s',r) rep- resenting taking action a in state s resulting in a transition to state s' and a reward of r. s' r - B А с B А B C B 4 4 -4 6 You may assume a discount factor of y = 1. 1. Recall the update function of Q-learning is: Q(s,,,) + (1 – a)O(, .) +a (1, +y max max Q(s:+1,d)) Assume that all Q-values are initialized to 0, and use a learning rate of a = (a) Run Q-learning on the above episode table and fill in the following Q-values: Q(A, ~) = Q(B, ~) = (6) After running Q-learning and producing the above Q-values, you construct a policy to that maximizes the Q-value in a given state: To(s) = arg max (s, a). What are the actions chosen by the policy in states A and B? To(A) is equal to: TO(B) is equal to: O(A) = +. O *(B) = +. OF(A) = 7. O *(B) = → OF(A) = Undefined OR(B) = Undefined 2. Use the empirical frequency count model-based reinforcement learning method described in lectures to esti- mate the transition function Î (s, a, s') and reward function Â(s, a, s'). (Do not use pseudocounts; if a transition is not observed, it has a count of 0.) Write down the following quantities. You may write N/A for undefined quantities. = Î(A, -,B) = Â(A, -, B) = Î(C,,B) = (C, -, B) = Î(B,–, A) = Â(B, —, A) = Î(B, -, A) = R(B., A) =