Two days ago, Deepseek surprised everyone with an "undefined-behavior" PTX optimization speeding up particular ML workloads on a Hopper NVIDIA GPU Kernel.
Let's reverse engineer the hack, implement it ourselves, and benchmark the speedup on an H100.
--
Link to my test code:
https://github.com/LaurieWired/BenchmarkCustomPTX
--
Timestamps
00:00 CUDA vs PTX vs SASS
02:12 Global Memory Target
03:27 Custom PTX Walkthrough
06:40 NVIDIA ISA Reference
07:42 Example Impelmentation
10:38 H100 Benchmark
11:46 SASS (Machine) Code
---
Follow LaurieWired on Social Media:
►https://linktr.ee/lauriewired
---