What is Self-Play?

Training agents correctly in an adversarial game can be quite complex.

On the one hand, we need to find how to get a well-trained opponent to play against your training agent. And on the other hand, even if you have a very good trained opponent, it’s not a good solution since how your agent is going to improve its policy when the opponent is too strong?

Think of a child that just started to learn soccer. Playing against a very good soccer player will be useless since it will be too hard to win or at least get the ball from time to time. So the child will continuously lose without having time to learn a good policy.

The best solution would be to have an opponent that is on the same level as the agent and will upgrade its level as the agent upgrades its own. Because if the opponent is too strong, we’ll learn nothing; if it is too weak, we’ll overlearn useless behavior against a stronger opponent then.

This solution is called self-play. In self-play, the agent uses former copies of itself (of its policy) as an opponent. This way, the agent will play against an agent of the same level (challenging but not too much), have opportunities to gradually improve its policy, and then update its opponent as it becomes better. It’s a way to bootstrap an opponent and progressively increase the opponent’s complexity.

It’s the same way humans learn in competition:

We start to train against an opponent of similar level
Then we learn from it, and when we acquired some skills, we can move further with stronger opponents.

We do the same with self-play:

We start with a copy of our agent as an opponent this way, this opponent is on a similar level.
We learn from it, and when we acquire some skills, we update our opponent with a more recent copy of our training policy.

Self-Play in MLAgents

Self-Play is integrated into the MLAgents library and is managed by multiple hyperparameters that we’re going to study. But the main focus as explained in the documentation is the tradeoff between the skill level and generality of the final policy and the stability of learning.

Training against a set of slowly changing or unchanging adversaries with low diversity results in more stable training. But a risk to overfit if the change is too slow.

We need then to control:

How often do we change opponents with swap_steps and team_change parameters.
The number of opponents saved with window parameter. A larger value of window means that an agent’s pool of opponents will contain a larger diversity of behaviors since it will contain policies from earlier in the training run.
Probability of playing against the current self vs opponent sampled in the pool with play_against_latest_model_ratio. A larger value of play_against_latest_model_ratio indicates that an agent will be playing against the current opponent more often.
The number of training steps before saving a new opponent with save_steps parameters. A larger value of save_steps will yield a set of opponents that cover a wider range of skill levels and possibly play styles since the policy receives more training.

We evaluate out agent with Elo Score.