Proximal Policy Optimization (PPO) is one of the most popular reinforcement learning algorithms, and works with a variety of domains from robotics control to Atari games to chip design
In this video, we dive deep into 8 implementation details for continuous action spaces and build from the PPO implementation from our first video (
https://youtu.be/MEt6rrxH8W4).
---
Source code: https://github.com/vwxyzjn/ppo-implementation-details
Related blog post: https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/
Background music: Flutes Will Chill — https://artlist.io/song/48722/flutes-will-chill
Homework solution: https://wandb.ai/cleanrl/cleanrl.benchmark/runs/34pstq7f/code?workspace=user-costa-huang
---
0:00 Introduction
0:41 Setup
1:30 1. Continuous actions via normal distributions
2:46 2. State-independent log standard deviation
3:50 3. Independent action components
4:37 Note on MultiDiscrete action space
5:36 Match hyperparameters
6:14 Environment preprocessing
6:33 4. Action clipped to the valid range
7:02 5. Observation normalization
7:54 6. Observation clipping
8:10 7. Reward normalization
9:00 8. Reward clipping
9:29 Experiment results
10:49 Related work
11:10 Summary of code change
11:58 Homework