[ot][spam][crazy] draft: learning RL

Sun May 8 07:19:56 PDT 2022

-------------

I'm looking through the reading for unit 1. I've seen some of this before
and it seems unnecessary, as is usually the case with first units.

So far it's mostly definition of terms.

https://huggingface.co/blog/deep-rl-intro

Example fo deep RL: a computer uses inaccurate simulations of a robot hand
to train a robot to dexterously manipulate objects with its fingers, on the
first go, using a general-purpose RL algorithm:
https://openai.com/blog/learning-dexterity/

Deep RL is taught in a way that harshly separates the use from the
implementation. It is of course a hacker's job to break down that
separation.

Update loop (works better if many of these happen in parallel):
State -> Action -> Reward -> Next State

It accumulates the reward over many updates, and then uses that cumulative
reward (or "expected return") to update the model that chose the actions.

"Reward hypothesis" : All goals can be expressed as the maximization of an
expected return.

"Markov Decision Process" : An academic term for reinforcement learning.
"Markov Property" : A property an agent has if it does not need to be
provided with historical information, and can operate only on current state
to succeed.

To me, the markov property implies that an agent stores sufficient
information within itself in order to improve, but this is not stated in
the material. Markov property seems like a user-focused worry to me, at
this point.

"State" : A description of all areas of the system the agent is within.

"Observation" : Information on only part of the system the agent is within,
such as areas near it, or from local sensors.

"Action space" : The set of all actions the agent may take in its
environment.

"Discrete action space" : An action space that is finite and completely
enumerable.

"Continuous action space" : An action space that is effectively infinite
[and subdivisible].

Some RL algorithms are specifically better at working with discrete or with
continuous action spaces.

"Cumulative Reward"

 The cumulative reward to maximize is defined as the sum of all rewards
from the start of the event loop to the end of time itself.

Since the system has limited time, and starts with very little at its
beginning, "discounting" is used.

"Discount rate" or "gamma" : Usually between 0.99 and 0.95.

When gamma is high, the discounting is lower, and the agent prioritises
long term reward. When gamma is low, the discounting is higher, and the
agent prioritises short term reward.

When calculating the cumulative reward, each reward is multiplied by gamma
to the power of the time step.

"Task" : an instance of a reinforcement learning problem

"Episodic task" : a task with a clear starting and ending point, such as a
level in a video game

"Continuous task" : an unending task where there is no opportunity to learn
over completion of the task, like stock trading

"Exploration" : spending time doing random actions to learn more
information about the environment

"Exploitation" : using known information to maximize reward

There are different ways of handling exploitation and exploration, but
roughly if there is too much exploitation the agent never leaves its
initial environment and keeps picking the most rewarding immediate thing,
whereas when exploring it spends time with poor reward to see if it finds
better reward.

"Policy" or "pi" : The function that selects actions based on states. The
optimal policy pi* is found based on training.

"Direct training" or "Policy-based methods" : The agent is taught which
action to perform, given its state.

"Indirect training" or "Value-based methods" : The agent is taught which
states are valuable, and then selects actions that lead to these states.

[ed: this seems obviously a spectrum of generality, it's a little
irritating direct training is mentioned at all, and no further things
listed after indirect training. maybe I am misunderstanding something. I'm
letting my more general ideas as to how to approach this slip to the
wayside a little, because I haven't been able to do anything with this
stuff my entire life. so this being valid to pursue seems useful. the below
stuff would be generalised and combined into a graph that feeds back to
itself to change its shape (or its metaness), basically.]

Policy Based Methods map from each state to the best corresponding action,
or a probability distribution over those.

"Deterministic policy" : each state will always return the same action.
"Stochastic policy" : each state produces a probability distribution of
actions

Value Based Methods map each state to the expected value of being at the
state. The value of a state is the expected return if starting from the
state, according to the policy of traveling to the highest-value state.

The description of value based methods glosses over (doesn't mention) the
need to retain a map of how actions move the agent through states.

"Deep reinforcement learning" : reinforcement learning that uses deep
neural networks in its policy algorithm(s).

The next section will engage value-based Q-Learning, first classic
reinforcement and then Deep Q-Learning. The difference is in whether the
mapping Q of action to value is made with a table or a neural network.

The lesson refers to https://course.fast.ai/ for more information on deep
neural networks.

The activity is training a moon lander at
https://github.com/huggingface/deep-rl-class/blob/main/unit1/unit1.ipynb .

The homework for beginners is to make the moon lander succeed, and to go
inside a little bit of the source code and recode it manually to get more
control over it (like, what would be needed to make an environment class
with a different spec?)

The homework for experts is to do the tutorial offline, and to either
privately train an agent that reaches the top of the leader boards, or
explain clearly why you did not.

The lesson states there is additional reading in the readme at
https://github.com/huggingface/deep-rl-class/blob/main/unit1/README.md .
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 9376 bytes
Desc: not available
URL: <https://lists.cpunks.org/pipermail/cypherpunks/attachments/20220508/d4c841e0/attachment.txt>