Two problems in RL
To understand what is Curiosity, we need first to understand the two major problems with RL: First, the sparse rewards problem: that is, most rewards do not contain information, and hence are set to zero.
Remember that RL is based on the reward hypothesis, which is the idea that each goal can be described as the maximization of the rewards. Therefore, rewards act as feedback for RL agents; if they don’t receive any, their knowledge of which action is appropriate (or not) cannot change.
For instance, in Vizdoom, a set of environments based on the game Doom “DoomMyWayHome,” your agent is only rewarded if it finds the vest. However, the vest is far away from your starting point, so most of your rewards will be zero. Therefore, if our agent does not receive useful feedback (dense rewards), it will take much longer to learn an optimal policy, and it can spend time turning around without finding the goal.
The second big problem is that the extrinsic reward function is handmade; in each environment, a human has to implement a reward function. But how we can scale that in big and complex environments?
What is curiosity?
Therefore, a solution to these problems is to develop a reward function that is intrinsic to the agent, i.e., generated by the agent itself. This intrinsic reward mechanism is known as curiosity because it explores states that are novel/unfamiliar. In order to achieve that, our agent will receive a high reward when exploring new trajectories.
This reward design is based on how human plays — some suppose we have an intrinsic desire to explore environments and discover new things! There are different ways to calculate this intrinsic reward, and we’ll focus on this article on curiosity through next-state prediction.
Curiosity Through Prediction-Based Surprise (or Next-State Prediction)
Curiosity is an intrinsic reward that is equal to the error of our agent of predicting the next state, given the current state and action taken. More formally, we can define this as:
Because the idea of Curiosity is to encourage our agent to perform actions that reduce the uncertainty in the agent’s ability to predict the consequences of its actions (uncertainty will be higher in areas where the agent has spent less time or in areas with complex dynamics).
If the agent spends a lot of time on these states, it will be good to predict the next state (low Curiosity). On the other hand, if it’s a new state unexplored, it will be hard to predict the following state (high Curiosity).
Consequently, measuring error requires building a model of environmental dynamics that predicts the next state given the current state and the action. The question that we can ask here is: how we can calculate this error?
The Need For A Good Feature Space
Before diving into the description of the module, we must ask ourselves: how can our agent predict the next state given our current state and our action?
But we can’t predict \(s_{t+1}\) by predicting the next frame as we usually do. Why?
First, because it’s hard to build a model that is able to predict high-dimension continuous state space, such as an image. It’s hard to predict the pixels directly, but now imagine you’re in Doom, and you move left — you need to predict \(248*248 = 61504\) pixels!
It means that in order to generate curiosity, we can’t predict the pixels directly and need to use a better feature representation by projecting the raw pixels space into a feature space that will hopefully only keep the relevant information elements that can be leveraged by our agent.
Three rules are defined in the original paper for a good representation feature space:
- Model things that can be controlled by the agent.
- Model things that can’t be controlled by the agent but can affect the agent.
- Don’t model things that are not controlled by the agent and have no effect on it.
The second reason we can’t predict \(s_{t+1}\) by predicting the next frame as we usually do? It might just not be the right thing to do: Imagine you need to study the movement of the tree leaves in a breeze. First of all, it’s already hard enough to model breeze, so predicting the pixel location of each leaves at each time step is even harder.
So instead of making predictions in the raw sensory space (pixels), we need to transform the raw sensory input (array of pixels) into a feature space with only relevant information.
The desired embedding space should:
- Be compact in terms of dimensional (remove irrelevant parts of the observation space).
- Preserve sufficient information about the observation.
- Be stable, because non-stationary rewards (rewards that decrease through time since curiosity decreases through time) make it difficult for reinforcement agents to learn.
In order to calculate the predicted next state and the real feature representation of the next state, we can use an Intrinsic Curiosity Module. But this technique has serious drawbacks because of the noisy TV problem, which we’ll introduce here.