This is actually several blog entries from my old website
stitched together. I hate to lose these things when I migrate software,
so I’m trying to keep it alive.
These are some pretty random thoughts, btw. My opinions have likely changed since writing this :)
So,
I'm reading this book "How the Mind Works" by Steven Pinker. Its great.
It speaks to the methods by which time has evolved very specialized
mental function in the brain. The idea is that we sometimes take for
granted that complex physical structures have evolved, but we think of
the mind as some general purpose thinking machine. Pinker's view is
that the mind has evolved in a similar way as the rest of the body. So
this got me thinking about the reward function in reinforcement
learning...
So,
in reinforcement learning, we generally have some states and a reward
function, and we want to find a policy that maximizes the discounted
sum of future rewards generated by this function. We have decent
solutions to finding such a policy in fairly complex domains.
But...
it takes a long time. And really, in real life we don't have a long
time. Take animals for example. I know, I know - trying to relate a new
idea to something I hardly understand from nature is a farce, but just
hear out this illustrative example. Animals know bad tastes from good
tastes. They have natural aversion to things that taste bitter and
natural attraction to things that taste sweet. This isn't something
that they learn, it is something that they are born with. Why? Why not
learn it? Because if animals had to learn everything from scratch, they
would die. Extinction. Animals run from loud noises. Same deal.
Evolution programmed some things in to help animals survive.
Ok,
but how does this apply to learning? Well, animals learn to associate
things. They can learn to associate people with loud noises for example
- so stay away from people. Or, maybe they will associate people with
food (don't feed the wildlife) - so people become a secondary
reinforcer, so approaching people is a good things.
Maybe
(and just maybe) if we want our agents to learn quickly and generalize
well, we need to tailor their reward function more than we have been. I
mean, look at a big maze. You can give a reward of -1 for every state
action pair except escape, and then marvel that the agent learns the
fastest way out. The problem is that they learn this optimal policy in
the limit, which can take a long time. When people first learn of
reinforcement learning, they almost always will say "Can't we give
positive rewards for going near the exit and negative rewards from
being far from it". The common answer with the classical viewpoint is
"no". The reason, because then you are doing all of the work, analyzing
the domain, crafting a reward function that helps the agent. Better, or
so we're told, is to just tell the agent the end result of what you
want it to do, exit the maze. Make everything else bad, and eventually
the agent will work things out.
All
those points are valid. But what if we have additional constraints.
Say, the agent is a failure if it does not exit the maze within a fixed
limit of time. Even if the agent is given the task over and over, it
may take a huge amount of time before it finds a way out. But... if we
had a reward function that rewards subgoal behaviour, perhaps this
agent could learn it's way out quickly, and on the first try. Wouldn't
that be neat? I think so.
So,
what does it take to tailor a reward function. Work. You have to try a
bunch of them, and do some sort of local search to get better ones. The
good news, you can do it in parallel, which saves some time.
I
think the big win here actually is going to be with function
approximation. What we often find is that we have a function
approximator which doesn't provide optimal discrimination along lines
that are necessary to maximize some reward function. Like, we want the
robot to get out of its pen, but the pen is round and our function
approximator uses squares. So, maybe the agent needs to do some bumping
into walls and zigzagging because some squares are "good sometimes" and
"bad othertimes". This is a bit wishy washy, but stay with me. Maybe
with an evolving reward function, we can make the task easier to learn.
Maybe we can provide rewards in such away that the overall task
(escaping the pen) is made easiest given the function approximation.
Maybe the reward function can evolve to exploit regularities in the
function approximator. Heck, maybe we can evolve the reward function
and the feature set in parallel and find interesting features that give
us discrimination and generalization exactly where we need it. Maybe we
could even evolve a starting policy at the same time and build in instinctive behaviour.
Anyways,
these are some ideas. I don't know if they've been done. I'm about to
read Geoffrey Hinton's paper on "How learning can guide evolution". I
think maybe its backwards and we should use evolution to guide
learning... but maybe not.
If this is new, its going to be some sort of search in reward space, maybe I can bundle it up into a neat paper.
This is a bit of an extension to the story above... I did some more thinking and read Geoffrey Hinton's paper.
So,
we're talking about crafting a reward function. But, this makes the
bottom fall out of our barrel. If agents are supposed to maximize their
reward, and we are learning a reward function to help the agent
succeed, the obvious degenerate case is for the the agent to get high
reward for doing nothing (or doing anything).
How
does nature deal with this problem? Nature doesn't even consider the
problem, because the goal and the rewards are distinct. It doesn't
matter how happy I am in my life, or how much reward I accumulate, if I
do not reproduce, then my gene's have failed in their goal, which is to
propagate themselves. 1 distinct, simple goal. Survive.
We
can see this is many different aspects of human nature (I think - I'm
no psychologist). Why is getting better than having? Why is the thrill
in the chase? Why do rich people gamble? Why take the smaller payout
instead of the larger one spread over time? People like to get.
Where am I going with this?
I'm
going to postulate that people like getting because there is a reward
for getting. I'll come back to this if I can make it more clear.
In
the maze example, we can see what we need to do. The reward evolution
decides how the agent is rewarded, but (like in nature - sheesh I'm
doing it again) the agent needs to be evaluated by an external process.
Did they get out of the maze? Did they get out of the maze fast? It
doesn't really matter how much reward (fun) the agent had running
around the maze, it matters if he got out. That is the fitness function
that guides the reward evolution and eventually evaluates the agent.
Will this work with more complex tasks?
Maybe
it'll work better. (Maybe not). We really need to provide input here,
which I would prefer if we didn't but for now we will and keep it
simple. Say we are making a robot that walks. If it falls, it fails. If
it moves forward some distance, it passes. There is the fitness
function. Pass/fail. Maybe.
What
about something like playing blackjack. This is harder. Rationally, it
seems that people are bad at gambling. People get addicted. People
chase their losses with more good money. If leaving with less than you
started with is failing, and leaving with much more than you started
with is winning, maybe our gambling reward function does just the right
thing? The expected value of gambling is losing, so perhaps a few big
bets is better than many smaller bets. If you are down a bunch of
money, the only way to not be down a bunch of money is to win. Maybe
chasing lost money is actually a good thing, to the goal of not being
down a bunch of money.
Anyways,
so this *is* the hard part, I won't deny it. By going a level up from
the reward function, we have to come up with a simpler fitness
function, something that is almost braindead. If not, then my whole
argument can be called recursively to some higher goal. Maybe that's
not a terrible idea, but its not the one I want to explore. What do we
do? Maybe standard RL goes are ok. Playing a game - winning the game is
good, losing is bad. If we are using a large population of agents, then
the stochasticity of games and different opponents works itself out. If
we are playing with a single agent, this doesn't work so well. But,
would evolution work with a small population? Nope.
I'm
trying to think of something with a really complicated reward function.
An example at ICML '04 where they did inverse reinforcement learning
was this car driving task. You wanted to stay on the road, not hit
people, not get hit by people, go fast, etc, etc. Their argument (if I
recall) for inverse RL was that people can perform this task well, but
have a hard time constructing the reward function for an agent to do as
well. If the penalty for going off the road is too weak, then the agent
will drive off road to avoid the stochastic nature of traffic. If this
penalty is too strong, then the agent will crash into other cars
instead of veering into the shoulder. What could we do here? I'm not
quite sure. Maybe I'll come back and edit this. Otherwise, send me an
e-mail if you have some idea.
So,
previously - I was all about reward function shaping. I read some work
by Andrew Ng, and he showed that reward shaping can be a little
dangerous, and perhaps we should do something that he describes with a
potential function. Many of the benefits, with less risk. Then I looked
at Eric Wiewora's research note showing that this potential function
scheme was the same as just setting the initial value function to the
potential function.
Maybe
this is a win? Initial value function is the same as using a potential
function which is safer and has all the benefits of changing the reward
function.
So
- I thought about evolving the value function. What I decided was that
evolving the value function has little benefit over starting the agent
over multiple times with random value functions and then taking some of
what was learned in each one and combining it. This then, is the same
as learning off policy with a few good exploratory policies?
So, is this whole direction a waste? Perhaps. I want to speak further with Vadim about them and see what he thinks.