Blog‎ > ‎

Research Thoughts

This is where I'll post blog entries about research ideas and thoughts.

Thoughts about the role of authors and reviewers in academic research

posted Dec 30, 2008, 10:36 AM by Brian Tanner   [ updated Jan 29, 2009, 7:12 PM ]

John Langford had an interesting post (as usual) on his blog about Adversarial Research.  In the comment to his post, someone anonymously posted the following:

I believe you have the ethical duty *reversed*. The point of reviewing is to identify the papers most likely to carry our field forward. It is our duty as reviewers, as PCs, and as editors to accept the work we view as the highest value, and to discourage work we feel is ‘philosophically incorrect’. If you don’t do this, why bother reviewing? This role is even more important in today’s environment where any work can be made easily available on the web– the only roles for reviewers is to identify the best material and to improve and correct the submissions they read by their comments. Identifying the best fundamentally involves our own philosophical viewpoints– we are trying to remove the nonsense.

This has inspired me to write down some of my thoughts on the subject.  I welcome your comments, unfortunately in the form of e-mail because Google Sites doesn't allow anonymous comments.

What is the role of a reviewer?  I think the answer differs from the viewpoint of the author or the program committee.

Here are a few angles I've heard on the topic from the perspective of the authors.

(Most of the commentary is my personal, and humble viewpoint as of this posting)

Author Perspectives

The "test" approach

Reviewers are there to test whether my work is good enough to be published.

If your work passes the test of peer review, then it's good enough to publish.  This is weird for many reasons, not the least of which is that these same people acknowledge that reviews are very stochastic.  This mindset is dangerous because it can lead to advocacy of a shotgun approach to paper submissions, which overwhelms the system, and disenfranchises reviewers by diluting the quality of submissions.

The "constructive" approach

Reviewers are excellent resources to help me develop my ideas and improve my work.

Submit a partially developed idea with partially convincing results and minimal citations from the literature.  If it's good enough, it'll get in to the conference.  If not, the reviewers will make valuable suggestions on how to improve it, and also will point out useful related work from the literature.  The reviewers might also check if the proof is right and point out typos.

The "check and balance" approach

Reviewers are necessary red-tape to protect the community from other authors that don't take submission as seriously as I do.

In a perfect world, I think this would be the ideal role of reviewers from the perspective of the authors.  The author should submit work that they champion and will stand behind.  The author must believe that their work should be published, that their results are reproducible and sufficiently general.  Inaccuracies or other issues found by the reviewers will be a surprise to this author, and a lesson that the author didn't quite do his/her due diligence: something to correct for next time.  This is an unrealistic ideal, but I think it's something to strive toward.


Both "test" and "constructive" authors think that the reviewer owes them something, that their responsibility (at last partly) is to help the author.

The "check and balance" author believes that the reviewer owes him nothing, except a cogent argument in the case of rejection.  This author may appreciate feedback and suggestions, but considers them to be strictly a bonus.

Reviewer Perspective

The anonymous post on John Langford's blog was from the perspective of the reviewer, not the author.  That post advocated political reviewing based on selecting the best papers that are most likely to move the community forward.  This won't necessarily bother "test" or "constructive" authors, but will seriously offend "check and balance" authors.

The "check and balance" author believes that the reviewer should be rejecting papers that have not proved their point or made their case.  These authors would probably agree with the Wikipedia definition of peer review (because I certainly do):

Excerpt of definition of Peer Review, from Wikipedia (emphasis mine):
Peer review (also known as refereeing) is the process of subjecting an author's scholarly work, research or ideas to the scrutiny of others who are experts in the same field. Peer review requires a community of experts in a given (and often narrowly defined) field, who are qualified and able to perform impartial review.

By allowing and even advocating reviewers to "identify the papers most likely to carry our field forward", being impartial is thrown out the window.  It does explain the following other statement from wikipedia.

Excerpt of definition of Peer Review, from Wikipedia (emphasis mine):
Impartial review, especially of work in less narrowly defined or inter-disciplinary fields, may be difficult to accomplish; and the significance (good or bad) of an idea may never be widely appreciated among its contemporaries.

Final Thoughts

I don't really have anything universal to say in this post, I just wanted to tie down a few thoughts and arguments that I've had with others in the last few years.

It's my sincere opinion that authors should strive for "check and balance" expectations, which probably implies submitting fewer papers.  If reviewers try to be impartial and also helpful, perhaps the whole system can work a little bit better, and we can even stop the numbers game.  (That's a plug for a very interesting paper about how counting publications slows scientific progress overall... if you don't have an ACM account the full text is posted here).

Disclaimer: I do research in Canada.  I'm told that we tend to be more collaborative and less competitive because our funding model supports that more than certain other countries.

Radical Transparency in Research

posted Aug 6, 2008, 2:26 PM by Brian Tanner

I've been stewing on some topics for a long time.  I'm a real zealot for strong experimental design and rigorous empiricism when evaluating my work and the work of others.

In helping to organize the reinforcement learning competition I always encouraged that we take conversations out of private e-mail exchanges and into a more public forum, even if the public forum was still private.

I have also decided to open-source many of the projects that I've put time into, from TD-Networks to RL-Viz, the bt-Recordbook, and others.

I've recently realized that there is a thread that is common to all of these endeavors and I believe it is that I am a strong advocate of radical transparency in research.

From Wikipedia:

Radical transparency is a management method where nearly all decision making is carried out publicly.

All draft documents, all arguments for and against a proposal, the decisions about the decision making process itself, and all final decisions, are made publicly and remain publicly archived.

The only exceptions to full transparency include data related to personal security or passwords or keys necessary for physical access required to carry out publicly negotiated decisions. Any technical actions which are perceived to be controversial or political are considered to lack legitimacy until a clear, radically transparent decision has been made concerning them.

This definition isn't perfect in the context of research, but I think you understand the point.  I don't just want the results of an experiment, I want a full account of why that was the experiment to do, and if there were other similar experiments that were tried and may have failed along the way.

This is a plan that I will follow with my own research.  I will be posting all of my code (even code that is in progress) in open-source projects, and I will be carefully documenting design decisions, and keeping track of a horrible amount of results through the bt-recordbook.

If you think this is a good or a bad idea, let me know!

Join the RL Mailing List

posted Jul 20, 2008, 10:36 AM by Brian Tanner

I've finally gotten around to publicizing the Reinforcement Mailing List (rl-list), a google group for the RL community to discuss events and research topics and to post files and create pages!

 Here is the e-mail I sent out to invite our group to this list:

Hi there. You're receiving this invite to the reinforcement learning mailing list (RL-List)!

If you have already joined the RL-List, please ignore this e-mail and accept my apologies.

Currently, the list is just a rough shell of what I think it can become. Feel free to really jump in, request to become a group manager, and help us to shape this group into what we want it to be!

Here is the group's description:
This is the discussion group for reinforcement learning, managed by a group of graduate students and faculty in the reinforcement learning community. The  idea of this group is to announce and discuss ideas and events that are relevant to the reinforcement learning community at large. Members are invited to send messages, create pages, and upload files to the group in order to better share information amongst ourselves.

You should sign up to the list with your preferred Google account (not necessarily Gmail) if you have one.  You can sign up at the RLAI home page:

Of course you can subscribe and unsubscribe at your leisure with whatever accounts you want. The group address is:

It's useful to bear in mind that you'll need to register a Google account for whatever e-mail address you intend to send mail *from* when addressing the list.

Basically, just sign in to Google however you normally do, and then come to and find the link to "join this group".  You can set options to change your e-mail delivery schedule (every e-mail,  daily digest, or no e-mail).

On the speed of research

posted Feb 28, 2008, 8:35 PM by Brian Tanner

Apr 3, 2006

I was at home for the holidays, and I found myself talking to old friends about Ray Kurzweil and his predictions about the future of technology and society. I'll admit, his book "The Age of Spiritual Machines, when computer exceed human intelligence" was important to me; it was one of the motivators that got me interested in strong artificial intelligence.

So found myself talking about how technological evolution is an exponential process, and how this has impacted the speed of research. I'm going to rehash the example I gave them here, because it's good to have things written down.

Let's say the year is 1990; and I have an interesting artificial intelligence idea. Is my idea novel? How can I find out? I'll mention it to my collegues, to determine if anyone has heard of similar work. I'll then take my leads and head to the library. Now I'll search through a book, microfiche, maybe a computer of conference and journal article titles (and maybe abstracts). This could take a very long time. Finally, I'll have a compiled list of works that may be relevant.

Some of these sources will be available in the well stocked University library. All I need to do is spend the afternoon running around, finding appropriate volumes, and marking down which volumes that I need are currently checked out. I'll have to fill out a form requesting the unavailable volumes when they return. That could take a week or so. For the sources not stocked locally, I'll fill out a request form, and those issues will be sent from wherever in North America they are to me; for an inter-library loan. Very cool. That will take several weeks as well. These are all short period loan items, so of course I'll have to spend a few hours photocopying everything that I might want a copy of.

So, after looking at these articles, I will probably learn that they are not exactly what I wanted. But!, they will probably cite related work that IS exactly what I wanted. So, I'll go back to the library with my new list of sources, and get my hands on what I can.

The funny thing about this story is that on one hand; it is fantastic. Using the library and inter-library loan procedure, almost any bit of published information is available to me. Quite amazing. The downside of course, is that it can take weeks to get access to some of the information, and it is not easy to search.

Of course, in todays information age, there are no such restrictions. Not only do we have access to most of these works; they are now accessible from our desk, they are also searchable, and it takes seconds instead of weeks to get the information. Literally, I can have an idea in the morning and have a feeling for the related work by the afternoon, while back in 1990 it would have taken weeks. That's the speed of research.

Also, a brief word on processing power. Computers are obviously much faster now than in 1990; but what has the impact of this been? In the past, it would have been necessary to commit significant computing resources to run a new algorithm (or a tweaked or bug-fixed algorithm) on some reasonable dataset, especially if multiple trials need to be performed to establish statistical significance. These experiments could take days or weeks. To run that same experiment now would take seconds or minutes. This acceleration of result availability means that we can try more ideas in a week than could have previously been done over the course of a research period.

In the past, after an interesting idea was published, it might take a year or two before other research groups could follow up on the idea and extend it. That latency has been cut drastically by the combination of the issues I've mentioned, along with others like prepublication manuscripts and online technical reports. This will continue, and we will see new ideas and improvements to existing ideas at an increasingly rapid pace.

Ok, so it's not deep, or life altering information. It's just a thought and/or musing. And it's exciting.

Where do Rewards Come From?

posted Feb 28, 2008, 8:21 PM by Brian Tanner   [ updated Jul 20, 2008, 11:38 AM ]

Apr 3, 2006

This is actually several blog entries from my old website stitched together. I hate to lose these things when I migrate software, so I’m trying to keep it alive.

These are some pretty random thoughts, btw. My opinions have likely changed since writing this :)

Entry 1:

So, I'm reading this book "How the Mind Works" by Steven Pinker. Its great. It speaks to the methods by which time has evolved very specialized mental function in the brain. The idea is that we sometimes take for granted that complex physical structures have evolved, but we think of the mind as some general purpose thinking machine. Pinker's view is that the mind has evolved in a similar way as the rest of the body. So this got me thinking about the reward function in reinforcement learning...

So, in reinforcement learning, we generally have some states and a reward function, and we want to find a policy that maximizes the discounted sum of future rewards generated by this function. We have decent solutions to finding such a policy in fairly complex domains.

But... it takes a long time. And really, in real life we don't have a long time. Take animals for example. I know, I know - trying to relate a new idea to something I hardly understand from nature is a farce, but just hear out this illustrative example. Animals know bad tastes from good tastes. They have natural aversion to things that taste bitter and natural attraction to things that taste sweet. This isn't something that they learn, it is something that they are born with. Why? Why not learn it? Because if animals had to learn everything from scratch, they would die. Extinction. Animals run from loud noises. Same deal. Evolution programmed some things in to help animals survive.

Ok, but how does this apply to learning? Well, animals learn to associate things. They can learn to associate people with loud noises for example - so stay away from people. Or, maybe they will associate people with food (don't feed the wildlife) - so people become a secondary reinforcer, so approaching people is a good things.

Maybe (and just maybe) if we want our agents to learn quickly and generalize well, we need to tailor their reward function more than we have been. I mean, look at a big maze. You can give a reward of -1 for every state action pair except escape, and then marvel that the agent learns the fastest way out. The problem is that they learn this optimal policy in the limit, which can take a long time. When people first learn of reinforcement learning, they almost always will say "Can't we give positive rewards for going near the exit and negative rewards from being far from it". The common answer with the classical viewpoint is "no". The reason, because then you are doing all of the work, analyzing the domain, crafting a reward function that helps the agent. Better, or so we're told, is to just tell the agent the end result of what you want it to do, exit the maze. Make everything else bad, and eventually the agent will work things out.
All those points are valid. But what if we have additional constraints. Say, the agent is a failure if it does not exit the maze within a fixed limit of time. Even if the agent is given the task over and over, it may take a huge amount of time before it finds a way out. But... if we had a reward function that rewards subgoal behaviour, perhaps this agent could learn it's way out quickly, and on the first try. Wouldn't that be neat? I think so.

So, what does it take to tailor a reward function. Work. You have to try a bunch of them, and do some sort of local search to get better ones. The good news, you can do it in parallel, which saves some time.

I think the big win here actually is going to be with function approximation. What we often find is that we have a function approximator which doesn't provide optimal discrimination along lines that are necessary to maximize some reward function. Like, we want the robot to get out of its pen, but the pen is round and our function approximator uses squares. So, maybe the agent needs to do some bumping into walls and zigzagging because some squares are "good sometimes" and "bad othertimes". This is a bit wishy washy, but stay with me. Maybe with an evolving reward function, we can make the task easier to learn. Maybe we can provide rewards in such away that the overall task (escaping the pen) is made easiest given the function approximation. Maybe the reward function can evolve to exploit regularities in the function approximator. Heck, maybe we can evolve the reward function and the feature set in parallel and find interesting features that give us discrimination and generalization exactly where we need it. Maybe we could even evolve a starting policy at the same time and build in instinctive behaviour.

Anyways, these are some ideas. I don't know if they've been done. I'm about to read Geoffrey Hinton's paper on "How learning can guide evolution". I think maybe its backwards and we should use evolution to guide learning... but maybe not.
If this is new, its going to be some sort of search in reward space, maybe I can bundle it up into a neat paper.

Entry 2:

This is a bit of an extension to the story above... I did some more thinking and read Geoffrey Hinton's paper.

So, we're talking about crafting a reward function. But, this makes the bottom fall out of our barrel. If agents are supposed to maximize their reward, and we are learning a reward function to help the agent succeed, the obvious degenerate case is for the the agent to get high reward for doing nothing (or doing anything).
How does nature deal with this problem? Nature doesn't even consider the problem, because the goal and the rewards are distinct. It doesn't matter how happy I am in my life, or how much reward I accumulate, if I do not reproduce, then my gene's have failed in their goal, which is to propagate themselves. 1 distinct, simple goal. Survive.
We can see this is many different aspects of human nature (I think - I'm no psychologist). Why is getting better than having? Why is the thrill in the chase? Why do rich people gamble? Why take the smaller payout instead of the larger one spread over time? People like to get.

Where am I going with this?

I'm going to postulate that people like getting because there is a reward for getting. I'll come back to this if I can make it more clear.
In the maze example, we can see what we need to do. The reward evolution decides how the agent is rewarded, but (like in nature - sheesh I'm doing it again) the agent needs to be evaluated by an external process. Did they get out of the maze? Did they get out of the maze fast? It doesn't really matter how much reward (fun) the agent had running around the maze, it matters if he got out. That is the fitness function that guides the reward evolution and eventually evaluates the agent. Will this work with more complex tasks?

Maybe it'll work better. (Maybe not). We really need to provide input here, which I would prefer if we didn't but for now we will and keep it simple. Say we are making a robot that walks. If it falls, it fails. If it moves forward some distance, it passes. There is the fitness function. Pass/fail. Maybe.

What about something like playing blackjack. This is harder. Rationally, it seems that people are bad at gambling. People get addicted. People chase their losses with more good money. If leaving with less than you started with is failing, and leaving with much more than you started with is winning, maybe our gambling reward function does just the right thing? The expected value of gambling is losing, so perhaps a few big bets is better than many smaller bets. If you are down a bunch of money, the only way to not be down a bunch of money is to win. Maybe chasing lost money is actually a good thing, to the goal of not being down a bunch of money.

Anyways, so this *is* the hard part, I won't deny it. By going a level up from the reward function, we have to come up with a simpler fitness function, something that is almost braindead. If not, then my whole argument can be called recursively to some higher goal. Maybe that's not a terrible idea, but its not the one I want to explore. What do we do? Maybe standard RL goes are ok. Playing a game - winning the game is good, losing is bad. If we are using a large population of agents, then the stochasticity of games and different opponents works itself out. If we are playing with a single agent, this doesn't work so well. But, would evolution work with a small population? Nope.

I'm trying to think of something with a really complicated reward function. An example at ICML '04 where they did inverse reinforcement learning was this car driving task. You wanted to stay on the road, not hit people, not get hit by people, go fast, etc, etc. Their argument (if I recall) for inverse RL was that people can perform this task well, but have a hard time constructing the reward function for an agent to do as well. If the penalty for going off the road is too weak, then the agent will drive off road to avoid the stochastic nature of traffic. If this penalty is too strong, then the agent will crash into other cars instead of veering into the shoulder. What could we do here? I'm not quite sure. Maybe I'll come back and edit this. Otherwise, send me an e-mail if you have some idea.

Entry 3:

So, previously - I was all about reward function shaping. I read some work by Andrew Ng, and he showed that reward shaping can be a little dangerous, and perhaps we should do something that he describes with a potential function. Many of the benefits, with less risk. Then I looked at Eric Wiewora's research note showing that this potential function scheme was the same as just setting the initial value function to the potential function.

Maybe this is a win? Initial value function is the same as using a potential function which is safer and has all the benefits of changing the reward function.

So - I thought about evolving the value function. What I decided was that evolving the value function has little benefit over starting the agent over multiple times with random value functions and then taking some of what was learned in each one and combining it. This then, is the same as learning off policy with a few good exploratory policies?

So, is this whole direction a waste? Perhaps. I want to speak further with Vadim about them and see what he thinks.

1-5 of 5