Consider an unknown MDP with three states (A, B and C) and two actions (+- and →). Suppose the agent chooses actions acc

Business, Finance, Economics, Accounting, Operations Management, Computer Science, Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Algebra, Precalculus, Statistics and Probabilty, Advanced Math, Physics, Chemistry, Biology, Nursing, Psychology, Certifications, Tests, Prep, and more.
Post Reply
answerhappygod
Site Admin
Posts: 899604
Joined: Mon Aug 02, 2021 8:13 am

Consider an unknown MDP with three states (A, B and C) and two actions (+- and →). Suppose the agent chooses actions acc

Post by answerhappygod »

Consider An Unknown Mdp With Three States A B And C And Two Actions And Suppose The Agent Chooses Actions Acc 1
Consider An Unknown Mdp With Three States A B And C And Two Actions And Suppose The Agent Chooses Actions Acc 1 (122.3 KiB) Viewed 35 times
Consider an unknown MDP with three states (A, B and C) and two actions (+- and →). Suppose the agent chooses actions according to some policy in the unknown MDP, collecting a dataset consisting of samples (s, a, s',r) rep- resenting taking action a in state s resulting in a transition to state s' and a reward of r. s' r - B А с B А B C B 4 4 -4 6 You may assume a discount factor of y = 1. 1. Recall the update function of Q-learning is: Q(s,,,) + (1 – a)O(, .) +a (1, +y max max Q(s:+1,d)) Assume that all Q-values are initialized to 0, and use a learning rate of a = (a) Run Q-learning on the above episode table and fill in the following Q-values: Q(A, ~) = Q(B, ~) = (6) After running Q-learning and producing the above Q-values, you construct a policy to that maximizes the Q-value in a given state: To(s) = arg max (s, a). What are the actions chosen by the policy in states A and B? To(A) is equal to: TO(B) is equal to: O(A) = +. O *(B) = +. OF(A) = 7. O *(B) = → OF(A) = Undefined OR(B) = Undefined 2. Use the empirical frequency count model-based reinforcement learning method described in lectures to esti- mate the transition function Î (s, a, s') and reward function Â(s, a, s'). (Do not use pseudocounts; if a transition is not observed, it has a count of 0.) Write down the following quantities. You may write N/A for undefined quantities. = Î(A, -,B) = Â(A, -, B) = Î(C,,B) = (C, -, B) = Î(B,–, A) = Â(B, —, A) = Î(B, -, A) = R(B., A) =
Join a community of subject matter experts. Register for FREE to view solutions, replies, and use search function. Request answer by replying!
Post Reply