It’s been a couple of weeks that my colleague and I started a project in Kaggle named Hungry Geese. That leads us to take a course and a bunch of youtube videos and books and so on. Right now, I want to summarise what I have understood so far from Reinforcement Learning.
General description: It’s a method that motivated to do better overtime by rewards.
A reward Rt is a scalar feedback signal that indicates how well the agent is doing at step t. The agent’s job is to maximize cumulative rewards.
For instance, if the agent is a helicopter and the job is to maneuver.
+point reward for following the desired trajectory.
–point reward for crashing.
Another example, the agent playing Backgammon, and the job is to defeat the world champion.
+point reward for winning.
–point reward for losing a game.
The goal of an agent in Reinforcement Learning is to select actions to maximize total future rewards. But, not to be confused with being greedy! It is more like an intelligence/long-term decision making that makes the agent sometimes lose rewards now, to gain more in the future.
Take a look at the fig below. In each step t the environment gives observation and reward to the agent and agent decides what action to take to maximize the total reward.

Now, let’s break down the figure above:
History State
The sequence At = A1, S1, R1, …, At, St, Rt is the information used to determine what happens next (basically looking at the last state). St=F(Ht)
Information State (AKA, Markov State)
This contains all the useful information from history.
A state St is Markov if and only of P[St+1| St]= P[St+1| S1, …, St].
We might say the future is independent of the past given the present.
H1:t —> St —> Ht+1:inf
Once the state is known the history may be thrown away.
Components of RL(Reinforcement Learning)
Policy (behavior function)
Value function (how good is each state or action)
Model (representation of the environment)
Policy represents a map from state to action contains deterministic and stochastic policy.
The value represents the prediction of future rewards. In other words, it evaluates the goodness or badness of states. Therefore, it selects between actions.
Model represents the prediction for what the environment will do next. It also predicts the next step (dynamics) and predicts the next (immediate) rewards.
There are two fundamental problems in sequential decision making. One is known environment the other is Unknown environment.