From Scratch: Shared Memory Atomics and Dynamic Allocation in CUDA
In this video we write a histogram kernel from scratch that uses shared memory atomics with dynamically allocated shared memory!
For code samples: http://github.com/coffeebeforearch
For live content: http://twitch.tv/CoffeeBeforeArch