CUDA Crash Course: Cache Tiled Matrix Multiplication
In this video we go over matrix multiplication using cache tiling (w/ shared memory) in CUDA!
For code samples: http://github.com/coffeebeforearch
For live content: http://twitch.tv/CoffeeBeforeArch