If you’re looking to get started pinch Reinforcement Learning, the OpenAI gym is undeniably the astir celebrated prime for implementing environments to train your agents. A wide scope of environments that are utilized arsenic benchmarks for proving the efficacy of immoderate caller investigation methodology are implemented successful OpenAI Gym, out-of-the-box. Furthermore, OpenAI gym provides an easy API to instrumentality your ain environments.
In this article, I will present the basal building blocks of OpenAI Gym. Here is simply a database of things I person covered successful this article.
Prerequisites
- Python: Beginner’s Python is required to travel along
- OpenAI Gym: Access to the OpenAI Gym situation and packages
Topics Covered
- Installation
- Environments
- Spaces
- Wrappers
- Vectorized Environments
So let’s get started.
Installation
The first point we do is to make judge we person the latest type of gym installed.
One tin either usage conda aliases pip to instal gym. In our case, we’ll usage pip.
pip instal -U gymEnvironments
The basal building artifact of OpenAI Gym is the Env class. It is simply a Python people that fundamentally implements a simulator that runs the environment you want to train your supplier in. Open AI Gym comes packed pinch a batch of environments, specified arsenic 1 wherever you tin move a car up a hill, equilibrium a swinging pendulum, people good connected Atari games, etc. Gym besides provides you pinch the expertise to create civilization environments arsenic well.
We commencement pinch an situation called MountainCar, wherever the nonsubjective is to thrust a car up a mountain. The car is connected a one-dimensional track, positioned betwixt 2 “mountains”. The extremity is to thrust up the upland connected the right; however, the car’s motor is not beardown capable to standard the upland successful a azygous go. Therefore, the only measurement to win is to thrust backmost and distant to build up momentum.
The extremity of the Mountain Car Environment is to summation momentum and scope the flag.
import gym env = gym.make('MountainCar-v0')The basal building of the situation is described by the observation_space and the action_space attributes of the Gym Env class.
The observation_space defines the building arsenic good arsenic the morganatic values for the study of the authorities of the environment. The study tin beryllium different things for different environments. The astir communal shape is simply a screenshot of the game. There tin beryllium different forms of observations arsenic well, specified arsenic definite characteristics of the situation described successful vector form.
Similarly, the Env people besides defines an property called the action_space, which describes the numerical building of the morganatic actions that tin beryllium applied to the environment.
obs_space = env.observation_space action_space = env.action_space print("The study space: {}".format(obs_space)) print("The action space: {}".format(action_space)) OUTPUT: The study space: Box(2,) The action space: Discrete(3)The study for the upland car situation is simply a vector of 2 numbers representing velocity and position. The mediate constituent betwixt the 2 mountains is taken to beryllium the origin, pinch correct being the affirmative guidance and near being the antagonistic direction.
We spot that some the study abstraction arsenic good arsenic the action abstraction are represented by classes called Box and Discrete, respectively. These are 1 of the various information structures provided by gym successful bid to instrumentality study and action spaces for different benignant of scenarios (discrete action space, continuous action space, etc). We will excavation further into these later successful the article.
Interacting pinch the Environment
In this section, we screen functions of the Env people that thief the supplier interact pinch the environment. Two specified important functions are:
- reset: This usability resets the situation to its first state, and returns the study of the situation corresponding to the first state.
- step : This usability takes an action arsenic an input and applies it to the environment, which leads to the situation transitioning to a caller state. The reset usability returns 4 things:
- observation: The study of the authorities of the environment.
- reward: The reward that you tin get from the situation aft executing the action that was fixed arsenic the input to the measurement function.
- done: Whether the section has been terminated. If true, you whitethorn request to extremity the simulation aliases reset the situation to restart the episode.
- info: This provides further accusation depending connected the environment, specified arsenic number of lives left, aliases wide accusation that whitethorn beryllium conducive successful debugging.
Let america now spot an illustration that illustrates the concepts discussed above. We first statesman by resetting the environment, past we inspect an observation. We past use an action and inspect the caller observation.
import matplotlib.pyplot as plt obs = env.reset() print("The first study is {}".format(obs)) random_action = env.action_space.sample() new_obs, reward, done, info = env.step(random_action) print("The caller study is {}".format(new_obs)) OUTPUT: The first study is [-0.48235664 0.] The caller study is [-0.48366517 -0.00130853]In this case, our study is not the screenshot of the task being performed. In galore different environments (like Atari, arsenic we will see), the study is simply a screenshot of the game. In either of the scenarios, if you want to spot really the situation looks successful the existent state, you tin usage the render method.
env.render(mode = "human")This should show the situation successful its existent authorities successful a pop-up window. You tin adjacent the model utilizing the adjacent function.
env.close()If you want to spot a screenshot of the crippled arsenic an image, alternatively than arsenic a pop-up window, you should group the mode statement of the render usability to rgb_array.
env_screen = env.render(mode = 'rgb_array') env.close() import matplotlib.pyplot as plt plt.imshow(env_screen)OUTPUT
Collecting each the small blocks of codification we person covered truthful far, the emblematic codification for moving your supplier wrong the MountainCar situation would look for illustration the following. In our lawsuit we conscionable return random actions, but you tin person an supplier that does thing much intelligent based connected the study you get.
import clip num_steps = 1500 obs = env.reset() for measurement in range(num_steps): action = env.action_space.sample() obs, reward, done, info = env.step(action) env.render() time.sleep(0.001) if done: env.reset() env.close()Spaces
The observation_space for our situation was Box(2,), and the action_space was Discrete(2,). What do these really mean? Both Box and Discrete are types of information structures called “Spaces” provided by Gym to picture the morganatic values for the observations and actions for the environments.
All of these information structures are derived from the gym.Space guidelines class.
type(env.observation_space)Box(n,) corresponds to the n-dimensional continuous space. In our lawsuit n=2, frankincense the observational abstraction of our situation is simply a 2-D space. Of course, the abstraction is bounded by precocious and little limits which picture the morganatic values our observations tin take. We tin find this utilizing the precocious and debased attributes of the study space. These correspond to the maximum and minimum positions/velocities successful our environment, respectively.
print("Upper Bound for Env Observation", env.observation_space.high) print("Lower Bound for Env Observation", env.observation_space.low) OUTPUT: Upper Bound for Env Observation [0.6 0.07] Lower Bound for Env Observation [-1.2 -0.07]You tin group these upper/lower limits while defining your space, arsenic good arsenic erstwhile you are creating an environment.
The Discrete(n) container describes a discrete abstraction pinch [0.....n-1] imaginable values. In our lawsuit n = 3, meaning our actions tin return values of either 0, 1, aliases 2. Unlike Box, Discrete does not person a precocious and debased method, since, by the very definition, it is clear what type of values are allowed.
If you effort to input invalid values successful the measurement usability of our situation (in our case, say, 4), it will lead to an error.
env.step(2) print("It works!") env.step(4) print("It works!")OUTPUT
There are aggregate different spaces disposable for various usage cases, specified arsenic MultiDiscrete, which let you to usage more than 1 discrete adaptable for your study and action space.
Wrappers
The Wrapper people successful OpenAI Gym provides you pinch the functionality to modify various parts of an situation to suit your needs. Why mightiness specified a request arise? Maybe you want to normalize your pixel input, aliases possibly you want to clip your rewards. While typically you could execute the aforesaid by making different people that sub-classes your situation Env class, the Wrapper people allows america to do it much systematically.
But earlier we begin, let’s move to a much analyzable situation that will really thief america admit the inferior that Wrapper brings to the table. This analyzable situation is going to beryllium the the Atari crippled Breakout.
Before we begin, we instal Atari components of gym.
!pip instal --upgrade pip setuptools wheel !pip instal opencv-python !pip instal gym[atari]If you person an correction to the tune of AttributeError: module 'enum' has nary property 'IntFlag', you mightiness request to uninstall the enum package, and past re-attempt the install.
pip uninstall -y enum34Gameplay of Atari Breakout
Let’s now tally the situation pinch random actions.
env = gym.make("BreakoutNoFrameskip-v4") print("Observation Space: ", env.observation_space) print("Action Space ", env.action_space) obs = env.reset() for one in range(1000): action = env.action_space.sample() obs, reward, done, info = env.step(action) env.render() time.sleep(0.01) env.close() OUTPUT: Observation Space: Box(210, 160, 3) Action Space Discrete(4)Our study abstraction is simply a continuous abstraction of dimensions (210, 160, 3) corresponding to an RGB pixel study of the aforesaid size. Our action abstraction contains 4 discrete actions (Left, Right, Do Nothing, Fire)
Now that we person our situation loaded, fto america suppose we person to make definite changes to the Atari Environment. It’s a communal believe successful Deep RL that we conception our study by concatenating the past k frames together. We person to modify the Breakout Environment specified that some our reset and measurement functions return concatenated observations.
For this we specify a people of type gym.Wrapper to override the reset and return functions of the Breakout Env. The Wrapper class, arsenic the sanction suggests, is simply a wrapper connected apical of an Env people that modifies immoderate of its attributes and functions.
The __init__ usability is defined pinch the Env people for which the wrapper is written, and the number of past frames to beryllium concatenated. Note that we besides request to redefine the study abstraction since we are now utilizing concatenated frames arsenic our observations. (We modify the study abstraction from (210, 160, 3) to (210, 160, 3 * num_past_frames.)
In the reset function, while we are initializing the environment, since we don’t person immoderate erstwhile observations to concatenate, we concatenate conscionable the first observations repeatedly.
from collections import deque from gym import spaces import numpy as np class ConcatObs(gym.Wrapper): def __init__(self, env, k): gym.Wrapper.__init__(self, env) self.k = k self.frames = deque([], maxlen=k) shp = env.observation_space.shape self.observation_space = \ spaces.Box(low=0, high=255, shape=((k,) + shp), dtype=env.observation_space.dtype) def reset(self): ob = self.env.reset() for _ in range(self.k): self.frames.append(ob) return self._get_ob() def step(self, action): ob, reward, done, info = self.env.step(action) self.frames.append(ob) return self._get_ob(), reward, done, info def _get_ob(self): return np.array(self.frames)Now, to efficaciously get our modified environment, we wrap our situation Env successful the wrapper we conscionable created.
env = gym.make("BreakoutNoFrameskip-v4") wrapped_env = ConcatObs(env, 4) print("The caller study abstraction is", wrapped_env.observation_space) OUTPUT: The caller study abstraction is Box(4, 210, 160, 3)Let america now verify whether the observations are so concatenated aliases not.
obs = wrapped_env.reset() print("Intial obs is of the shape", obs.shape) obs, _, _, _ = wrapped_env.step(2) print("Obs aft taking a measurement is", obs.shape) OUTPUT: Intial obs is of the style (4, 210, 160, 3) Obs aft taking a measurement is (4, 210, 160, 3)There is much to Wrappers than the vanilla Wrapper class. Gym besides provides you pinch circumstantial wrappers that target circumstantial elements of the environment, specified arsenic observations, rewards, and actions. Their usage is demonstrated successful the pursuing section.
- ObservationWrapper: This helps america make changes to the study utilizing the study method of the wrapper class.
- RewardWrapper: This helps america make changes to the reward utilizing the reward usability of the wrapper class.
- ActionWrapper: This helps america make changes to the action utilizing the action usability of the wrapper class.
Let america suppose that we person to make the travel changes to our environment:
- We person to normalize the pixel observations by 255.
- We person to clip the rewards betwixt 0 and 1.
- We person to forestall the slider from moving to the near (action 3).
Now we use each of these wrappers to our situation successful a azygous statement of codification to get a modified environment. Then, we verify that each of our intended changes person been applied to the environment.
env = gym.make("BreakoutNoFrameskip-v4") wrapped_env = ObservationWrapper(RewardWrapper(ActionWrapper(env))) obs = wrapped_env.reset() for measurement in range(500): action = wrapped_env.action_space.sample() obs, reward, done, info = wrapped_env.step(action) if (obs > 1.0).any() or (obs < 0.0).any(): print("Max and min worth of observations retired of range") if reward < 0.0 or reward > 1.0: assert False, "Reward retired of bounds" wrapped_env.render() time.sleep(0.001) wrapped_env.close() print("All checks passed") OUTPUT: All checks passedIn lawsuit you want to retrieve the original Env aft applying wrappers to it, you tin usage the unwrapped property of an Env class. While the Wrapper people whitethorn look for illustration conscionable immoderate different people that sub-classes from Env, it does support a database of wrappers applied to the guidelines Env.
print("Wrapped Env:", wrapped_env) print("Unwrapped Env", wrapped_env.unwrapped) print("Getting the meaning of actions", wrapped_env.unwrapped.get_action_meanings()) OUTPUT: Wrapped Env :<ObservationWrapper<RewardWrapper<ActionWrapper<TimeLimit<AtariEnv<BreakoutNoFrameskip-v4>>>>>> Unwrapped Env <AtariEnv<BreakoutNoFrameskip-v4>> Getting the mean of actions ['NOOP', 'FIRE', 'RIGHT', 'LEFT']Vectorized Environments
A batch of Deep RL algorithms (like Asynchronous Actor Critic Methods) usage parallel threads, wherever each thread runs an lawsuit of the situation to some velocity up the training process and amended efficiency.
Now we will usage different library, besides by OpenAI, called baselines. This room provides america pinch performant implementations of galore modular Deep RL algorithms to comparison immoderate caller algorithm with. In summation to these implementations, baselines besides provides america pinch galore different features that alteration america to hole our environments successful accordance pinch the measurement they were utilized successful OpenAI experiments.
One of these features includes wrappers which let you to tally aggregate environments successful parallel utilizing a azygous usability call. Before we begin, we first proceed pinch the installation of baselines by moving the pursuing commands successful a terminal.
!git clone https://github.com/openai/baselines !cd baselines !pip instal .You whitethorn request to restart your Jupyter notebook for the installed package to beryllium available.
The wrapper of liking present is called SubProcEnv, which will tally each the environments successful an asynchronous method. We first create a database of usability calls that return the situation we are running. In code, I person utilized a lambda usability to create an anonymous usability that returns the gym environment.
import gym from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv num_envs = 3 envs = [lambda: gym.make("BreakoutNoFrameskip-v4") for one in range(num_envs)] envs = SubprocVecEnv(envs)This envs now acts arsenic a azygous situation wherever we tin telephone the reset and measurement functions. However, these functions return an array of observations/actions now, alternatively than a azygous observation/action.
init_obs = envs.reset() print("Number of Envs:", len(init_obs)) one_obs = init_obs[0] print("Shape of 1 Env:", one_obs.shape) actions = [0, 1, 2] obs = envs.step(actions) OUTPUT: Number of Envs: 3 Shape of 1 Env: (210, 160, 3)Calling the render usability connected the vectorized envs displays screenshots of the games successful a tiled fashion.
import clip num_envs = 3 envs = [lambda: gym.make("BreakoutNoFrameskip-v4") for one in range(num_envs)] envs = SubprocVecEnv(envs) init_obs = envs.reset() for one in range(1000): actions = [envs.action_space.sample() for one in range(num_envs)] envs.step(actions) envs.render() time.sleep(0.001) envs.close()The pursuing surface plays out.
render for the SubProcEnv environment.
You tin find much astir Vectorized environments here.
Conclusion
That’s it for Part 1. Given the things we person covered successful this part, you should beryllium capable to commencement training your reinforcement learning agents successful environments disposable from OpenAI Gym. But what if the situation you want to train your supplier successful is not disposable anywhere? If that’s the case, you are successful luck for a mates of reasons!
Firstly, OpenAI Gym offers you the elasticity to instrumentality your ain civilization environments. Second, doing that is precisely what Part 2 of this bid is going to beryllium about. Till then, bask exploring the enterprising world of reinforcement learning utilizing Open AI Gym!
Further Reading
- Open AI Gym Documentation
- Baselines GitHub Page