Question 7 (Reinforcement Learning (18 pts)) Consider the MDP shown below. There are five states {A, B, C, D, T} in MDP,

Post by **answerhappygod** » Wed May 11, 2022 7:17 pm

: Question 7 Reinforcement Learning 18 Pts Consider The Mdp Shown Below There Are Five States A B C D T In Mdp 1 (188.07 KiB) Viewed 21 times

: Question 7 Reinforcement Learning 18 Pts Consider The Mdp Shown Below There Are Five States A B C D T In Mdp 2 (30.81 KiB) Viewed 21 times

Question 7 (Reinforcement Learning (18 pts)) Consider the MDP shown below. There are five states {A, B, C, D, T} in MDP, where state {T} is the target state, and two actions {aj, a2}. The numbers on each transition show the probability of moving to the next state and the reward of transition, respectively. For example, if agent takes action ay at state A, it will end up at state B with probability 0.8 and will be rewarded -10, and with probability 0.2 will move to state C and will be rewarded -10. 0.8, -10 (probability, reward) a1 0.1, -10 A B 01 02 02 0.4, -6 0.2, -10 0.9, -10 1,-10 1,-10 a1 A1, A2 0.6, -8 D T. 1, 100 02 1, -12

a) For a policy a which always takes action ay at every state, write down the Bellman recursive value function for each state, i.e., V7(A), V7(B), 41(C), V7(D), V7(T), and compute the final state values when y= = 1

b) Consider a random policy which uniformly selects actions at each state (the probability of taking each of two actions under this policy is z). Apply one iteration of Value Iteration algorithm (one- step policy evaluation followed by policy greedification) on this MDP with y = 1 and show the new improved policy.

= c) Consider the following episode generated by an arbitrary policy 7. Assume the current values of all the state values are: V7(A) = 0, V7(B) = 5, V1(C) = 2, V7(D) = 10, and V7(T) = 0. Please, i) write down the Temporal Difference (TD) evaluation equation for updating the values of states, ii) compute the final values after processing the episode shown in the figure with y = a = a 1 p= -6 A a1 p= -10 B a2 p=-10 a2 r = = 100