What is it?
Unity ML-Agents is a toolkit for the game engine Unity that allows us to create environments using Unity or use pre-made environments to train our agents.
The six components
With Unity ML-Agents, you have six essential components:
- The first is the Learning Environment, which contains the Unity scene (the environment) and the environment elements (game characters).
- The second is the Python Low-level API, which contains the low-level Python interface for interacting and manipulating the environment. It’s the API we use to launch the training.
- Then, we have the External Communicator that connects the Learning Environment (made with C#) with the low level Python API (Python).
- The Python trainers: the Reinforcement algorithms made with PyTorch (PPO, SAC…).
- The Gym wrapper: to encapsulate RL environment in a gym wrapper.
- The PettingZoo wrapper: PettingZoo is the multi-agents of gym wrapper.
Inside the Learning Component
Inside the Learning Component, we have three important elements:
- The first is the agent component, the actor of the scene. We’ll train the agent by optimizing its policy (which will tell us what action to take in each state). The policy is called Brain.
- Finally, there is the Academy. This component orchestrates agents and their decision-making processes. Think of this Academy as a teacher who handles Python API requests.
To better understand its role, let’s remember the RL process. This can be modeled as a loop that works like this:
- Our Agent receives state \(S_0\) from the Environment — we receive the first frame of our game (Environment).
- Based on that state \(S_0\), the Agent takes action \(A_0\) — our Agent will move to the right.
- Environment goes to a new state \(S_1\) — new frame.
- The environment gives some reward \(R_1\) to the Agent — we’re not dead (Positive Reward +1).
This RL loop outputs a sequence of state, action, reward and next state. The goal of the agent is to maximize the expected cumulative reward.
The Academy will be the one that will send the order to our Agents and ensure that agents are in sync:
- Collect Observations
- Select your action using your policy
- Take the Action
- Reset if you reached the max step or if you’re done.