Learning/Paradigm/Reinforcement

Reinforcement Learning Algorithms (AKA Hebbian Learning)

Learning from success and failure, reward and punishment via trial-and-error. This is a popular model for agents to use in learning behavior from experience, and the most general of the 3. Rather than being told what to do, and what not to do, this type of agent must learn from reinforcements, such as reward and punishment.

Reinforcement learning is concerned with interacting with a dynamic environment, it is more an abstract class of problems rather than a concrete technique. It is an active area of research in AI, and mathematical results exist here more than other areas of AI.

The origins of the work behave - To be a certain way in order to have a certain thing.

Note that not all reinforcement learning agent use a model of their environment.



Introduction

While there are a few different types of learning, reinforcement learning normally helps adjust our physical actions and motor skills. The actions an organism perform result in feedback, which in turn is translated into a negative or positive reward for that action. As a baby learns to walk it may fall down and experience a little pain. This negative feedback will help the baby learn what not to do. If the baby has been able to stay standing for a matter of time, then it is doing something right and that achievement of the goal is a positive feedback: another positive feedback may be it being able to access something it desired. As the baby continues to try to walk, it will develop motor coordination in such a way that reward will be maximized. Pain, hunger, thirst, and pleasure are some examples of natural reinforcements. Actions can either result in immediate reward or be part of a longer chain of actions eventually leading to reward.

Reinforcement learning, is learning from interaction with the environment. Here we have a temporally situated continual learning and planning agent in a stochastic and uncertain environment...



The Value-Function

The value-function is a mapping from states to real numbers, where the value of a state represents the long-term reward achieved starting from that state, and executing a particular policy.

The key distinguishing feature of reinforcement learning methods is that they learn policies indirectly, by instead learning value-functions.

RL methods can be contrasted with direct optimization methods, such as ga (GA), which attempt to search the policy space directly.


Loading...

Concrete Algorithms for RL

The following three classes of algorithms solve the full version of the reinforcement learning problem, and all support delayed rewards. As with everything, each of the three methods has its own unique strengths and weaknesses.

The methods also differ in several ways with respect to their efficiency and convergence speed.

The method can even be combined so as to combine their best features into a hybrid algorithm.

Dynamic Programming (DP)

Dynamic programming methods are very well-developed mathematically, however they require a complete and accurate model of the environment. This makes them suitable for a simulated environment, but not necessarily a real-world environment.

Monte Carlo (MC) Methods

Monte-Carlo methods don't require an environment model (in contrast to DP), and are conceptually simple, but they're not suited to step-by-step incremental computation.

MC is a class of methods specifically for learning value-functions. MC methods estimate the value of a state by running numerous trials, starting at that state, then averaging the total rewards received on those trials.

Temporal-Difference Learning (TDL)

Temporal-difference also requires no environment model (in contrast to DP), and are fully incremental (in contrast to MC), but are more complex to analyze.

Temporal-Difference Learning is a class of learning methods, based on the idea of comparing temporally successive predictions. It is possibly the single-most fundamental idea in all of reinforcement learning.



Markov Decision Process (MDP)

A reinforcement learning task that satisfies the Markov property is called a Markov decision process, or MDP. Furthermore, if the state space and the action space are finite, then it is called a finite Markov decision process (finite MDP).

Finite Markov Decision Process (FMDP)

Finite MDPs are particularly important to the theory of reinforcement learning, in fact, they are all that is required to understand 90% of all modern reinforcement learning.



Framework

  • There is a set S of states, and a set A of actions.
  • At each time-step t, the agent is in some state st.
  • The agent must select a particular an from the set A while in state Sm (of set S).
  • Each action takes will result in a reward, punishment or nothing.
  • The mathematical representation of this is as follows...
    • st+1 = δ(st, at), rt+1 = r(st, at)
    • δ is the transition
    • r is the reward
    • s is the particular state
    • a is the particular action

The aim is to find an optimal policy pi: SA which maximizes the cumulative reward.

A high quality feed is available too.

Example

Consider a world with two states and two actions...

  • S = { S1, S2 }
  • A = { a1, a2 }

...where the transition δ and reward r is described as follows...

  • δ(S1, a1) = S1, r(S1, a1) = 0 (No reward)
  • δ(S1, a2) = S2, r(S1, a2) = -1 (Punishment)
  • δ(S2, a1) = S2, r(S2, a1) = +1 (Reward)
  • δ(S2, a2) = S1, r(S2, a2) = +5 (Reward)



Models of Opimality

  • Finite Horizon Reward
    • Computationally simple.
  • Average Reward
    • Easier for proving theorems.
  • Infinite Discounted Reward
    • Difficult to work with as can't objectively decide between small rewards soon and large reward in the future - Is a fast nickel worth a slow dime?



Environments

Learning a value function is possible in the following environments...


Reinforcement vs Supervised Learning

  • No presentation of input/output pairs - i.e. no training data.
  • Agent chooses actions, and receives reinforcement.
  • The environment is usually stochastic.
  • It is important that the agent performs well online, not just offline.
  • Systems must explore the space of actions.

 <<<— TODO: Create a scheme for VS pages, so these blocks can be their own page shared between the LHS & RHS —>>> 

Applications

Height & Width are required There are many practical applications for Reinforcement Learning and these are just a few examples...

  • Control Problems
  • Games
  • Other Sequential Decision-Making


References

Ancstors ☣ LearningLearning/Paradigm
Siblings ☣ Learning/Paradigm/SupervisedLearning/Paradigm/UnsupervisedLearning/Paradigm/Reinforcement
Other ☣