Analyzing Deepseek's "undefined" NVIDIA PTX optimizations (with benchmarks!)

106.155 Lượt nghe

00:00

Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.

Tải MP3

MÔ TẢ MP3TIẾP THEO

Analyzing Deepseek's "undefined" NVIDIA PTX optimizations (with benchmarks!)

Two days ago, Deepseek surprised everyone with an "undefined-behavior" PTX optimization speeding up particular ML workloads on a Hopper NVIDIA GPU Kernel.

Let's reverse engineer the hack, implement it ourselves, and benchmark the speedup on an H100.

--

Link to my test code:
https://github.com/LaurieWired/BenchmarkCustomPTX

--

Timestamps

00:00 CUDA vs PTX vs SASS
02:12 Global Memory Target
03:27 Custom PTX Walkthrough
06:40 NVIDIA ISA Reference
07:42 Example Impelmentation
10:38 H100 Benchmark
11:46 SASS (Machine) Code

---

Follow LaurieWired on Social Media:
►https://linktr.ee/lauriewired

---					

Analyzing Deepseek's "undefined" NVIDIA PTX optimizations (with benchmarks!)

Nhạc Theo Chủ Đề

Liên kết website

Analyzing Deepseek's "undefined" NVIDIA PTX optimizations (with benchmarks!)

Những bài liên quan

Chưa có bài liên quan nào!