Proximal Policy Optimization Implementation: 8 Details for Continuous Actions (3/3)

Proximal Policy Optimization Implementation: 8 Details for Continuous Actions (3/3)

10.788 Lượt nghe
Proximal Policy Optimization Implementation: 8 Details for Continuous Actions (3/3)
Proximal Policy Optimization (PPO) is one of the most popular reinforcement learning algorithms, and works with a variety of domains from robotics control to Atari games to chip design In this video, we dive deep into 8 implementation details for continuous action spaces and build from the PPO implementation from our first video (https://youtu.be/MEt6rrxH8W4). --- Source code: https://github.com/vwxyzjn/ppo-implementation-details Related blog post: https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/ Background music: Flutes Will Chill — https://artlist.io/song/48722/flutes-will-chill Homework solution: https://wandb.ai/cleanrl/cleanrl.benchmark/runs/34pstq7f/code?workspace=user-costa-huang --- 0:00 Introduction 0:41 Setup 1:30 1. Continuous actions via normal distributions 2:46 2. State-independent log standard deviation 3:50 3. Independent action components 4:37 Note on MultiDiscrete action space 5:36 Match hyperparameters 6:14 Environment preprocessing 6:33 4. Action clipped to the valid range 7:02 5. Observation normalization 7:54 6. Observation clipping 8:10 7. Reward normalization 9:00 8. Reward clipping 9:29 Experiment results 10:49 Related work 11:10 Summary of code change 11:58 Homework