3 Reinforcement Learning Background 1. Each True/False question is worth 1 points. Briefly justify your answers. (1) [tr

Post by **answerhappygod** » Fri May 20, 2022 4:50 pm

: 3 Reinforcement Learning Background 1 Each True False Question Is Worth 1 Points Briefly Justify Your Answers 1 Tr 1 (156.87 KiB) Viewed 41 times

3 Reinforcement Learning Background 1. Each True/False question is worth 1 points. Briefly justify your answers. (1) [true or false] Temporal difference learning is an offline learning method. (i) (true or false] Q-learning: Using an optimal exploration function can lead to a chance of regret while learning the optimal policy. (iii) [true or false] In a deterministic MDP (i.e. one in which each state / action leads to a single deterministic next state), the Q-learning update with a learning rate of a = 1 will correctly learn the optimal q-values (assume that all state/action pairs are visited sufficiently often). (iv) (true or false] A large discount (close to 1) encourages greedy behavior. (v) [true or false] A large, negative living reward (< 0) encourages greedy behavior. (vi) [true or false] A negative living reward can always be expressed using a discount < 1. (vii) (true or false] A discount < 1 cannot always be expressed as a negative living reward. 2. This question considers properties of reinforcement learning algorithms for arbitrary discrete MDPs. (a) Select all the following methods which, at convergence, do not provide enough information to obtain an optimal policy. (Assume adequate exploration.) Model-based learning of T(s, a, s') and R(s, a, s'). Direct Evaluation to estimate U(s). Temporal Difference learning to estimate U(s). Q-Learning to estimate O(s, a). (b) In the limit of infinite timesteps, select all of the following exploration policies for which Q-learning is guaranteed to converge to the optimal Q-values for all state. (You may assume the learning rate a is chosen appropriately, and that the MDP is ergodic: i.e., every state is reachable from every other state with non-zero probability.) A fixed policy taking actions uniformly at random. A greedy policy. An e-greedy policy A fixed optimal policy.