Analyzing Deepseek's

Analyzing Deepseek's "undefined" NVIDIA PTX optimizations (with benchmarks!)

106.155 Lượt nghe
Analyzing Deepseek's "undefined" NVIDIA PTX optimizations (with benchmarks!)
Two days ago, Deepseek surprised everyone with an "undefined-behavior" PTX optimization speeding up particular ML workloads on a Hopper NVIDIA GPU Kernel. Let's reverse engineer the hack, implement it ourselves, and benchmark the speedup on an H100. -- Link to my test code: https://github.com/LaurieWired/BenchmarkCustomPTX -- Timestamps 00:00 CUDA vs PTX vs SASS 02:12 Global Memory Target 03:27 Custom PTX Walkthrough 06:40 NVIDIA ISA Reference 07:42 Example Impelmentation 10:38 H100 Benchmark 11:46 SASS (Machine) Code --- Follow LaurieWired on Social Media: ►https://linktr.ee/lauriewired ---