Notes on Reinforcement Learning: Overview and MDP

Sep 26, 2017

Reinforcement Learning Overview

RL is a sub-field of Machine Learning that deals with making agents learn to make sequential decisions in unknown environments. The agents should learn by interacting with the environment and observing occasional reward (or reinforcement) signals rather than from direct supervision.

Markov Decision Process

MDP: The single agent RL problem is usually formalized using the classical concept of a Markov Decision Process (MDP). An MDP is characterized by the tuple $\langle S, A, R(s,a), T(s,a,s’) \rangle$. Consider an environment with $n$ states. $S$ denotes the set of all states in the environment. Let the environment be stochastic but stationary. An agent interacting with this environment can take an action from the set $A$ of available actions in any state s. The probability of transitioning from a state, $s$ to the next state, $s’$ after the agent has taken an action $a$, is given by the transition function $T(s,a,s’)$. $T$ represents the set of rules for progression or physics of the environment. Notice the Markovian assumption here i.e. the probability of transition to next state depends only on the current state and action. Thus agents actions have consequences in the environment as they influence the probability distribution of transition to next state. $R(s,a)$ represents the scalar reward that the agent receives for taking an action a in state s. The process of observing a state, taking an action, receiving the reward and moving to the next state keeps repeating till the end of the episode. Consider the frozen lake environment in OpenAI Gym. Let us play with this environment to get the hang of the MDP concept. Install gym from here and get started.

You can explore this simple MDP and observe the stochastic transition between states when the agent takes various actions. Now that you understand the problem set up, we can start thinking about how to solve it.

Solving an MDP: There exist many algorithms for solving an MDP. Let us first discuss what it means to solve an MDP. The goal of the agent in an MDP is to maximize the long-term expected reward. The solution of an MDP is a policy which is a mapping from every state to a specific action. The goal is to learn an optimal policy which guarantees maximum reward. This is what makes RL different from supervised learning. We don’t have any kind of supervision here which gives us examples of good state-action pairs. We have to learn this mapping solely based on the reward signal which is often delayed. Delayed in the sense that for a good action that you took right now you might not see any payoff immediately but it might put you in such subsequent states that eventually lead to high reward. Figuring out what actions lead to a certain sequence of future rewards is known as the problem of temporal credit assignment.

Note at this point, the difference between a plan and a policy. A plan is a sequence of actions that the agent has decided to execute in the beginning of the MDP, while a policy just describes what action to take when it reaches a certain state, if ever. Since we are dealing with probabilistic transition, executing a plan might not be the best idea since the actions don’t always have the intended effects in a stochastic world. And as soon as the world throws agent off its plan to an unexpected state, it will have no clue as to how to proceed.

Another important point to note is that assignment of rewards while designing an MDP to describe the characteristics of a certain real environment is a way of injecting our domain knowledge into the system. The reward function should be designed very carefully as slightly different reward structures can induce very different agent behavior.

Horizon: Okay so, here we are dealing with infinite horizons. What that means is that there is no fixed duration for the sequence of time-steps till end of the process. There may be terminal states such that the process ends upon reaching them but time to reach those states is indefinite. In essence, the agent can live forever and has infinite time to work with until it reaches a terminal state. It is to be appreciated that absence of infinite horizon assumption, can lead to drastically different agent strategies. For example, in a finite horizon, shorter version of Frozen Lake environment, it might make more sense to take the quicker risky route instead of the safer and longer one as you don’t have enough time for the latter. More importantly, in a finite horizon case, the policy will be time dependent, i.e. you might want to take different actions from the same state depending on how much time is left before the end of process. The policy no longer remains stationary as in infinite horizon case.

Alright! This gives us a pretty good understanding of the formal set-up of Markov Decision Process. As I mentioned in the beginning, this is the most popular framework for many RL tasks. But, it is to be remembered that this is not the only framework. There exist scenarios where this framework is not sufficient to model all the complexity of the environment, for example, in the case of partially observed states or multi-agent environments. Moreover, other frameworks exist, such as predictive state representations which provide a non-Markov way of representing dynamical systems. Different formalisms have their own pros and cons. Each one lends itself naturally to a certain environment and facilitate learning in that environment when others don’t. Choosing a good representation itself should be seen as a part of the solution of the general problem of creating intelligent agents. That said, MDPs are often very successful in modeling many scenarios and thus, it is a great place to start getting a deeper understanding into the subject of RL.

I hope you enjoyed this introduction and are now excited to learn some algorithms solve the MDP already! Let’s do that in the next post.

References