If you need help with anything quantization or ML related (e.g. debugging code) feel free to book a 30 minute consultation session! https://calendly.com/oscar-savolainen
I'm also available for long-term freelance work, e.g. for training / productionizing models, teaching AI concepts, etc.
*Video Summary:*
In this video, we go over the theory about what is happening when we quantize a floating point tensor.
*Timestamps:*
00:00 Intro
01:12 How neural networks run on hardware
01:57 How do quantized neural networks run on hardware
03:42 Fake quantization vs Conversion
05:27 Fake quantization (what are quantization parameters?)
12:29 Affine vs symmetric quantization
15:17 How do we determine quantization parameters?
18:52 Quantization granularity (per-channel vs per-tensor quantization)
21:46 Conclusion
*Links:*
NVIDIA white paper: https://arxiv.org/abs/2004.09602
Qualcomm white paper: https://arxiv.org/abs/2106.08295
Qualcomm SDK docs specifying the constraint on the zero-point: https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/quantization.html (they specify zero must be exactly representable).
*Correction:*
I also want to correct something: I say that Dynamic quantization (where one infers the activation quantization parameters at runtime) is generally not feasible for runtime. That is incorrect. It is generally true specifically in the context of hardware-constrained edge devices like those I am used to. However, for running LLMs on GPUs, Dynamic quantization is actually the standard, since people mainly just care about reducing the size of the weight tensor due to the memory-bound environment.
One can also do weight-only quantization, where one does not even quantize the activations. This is typically done if one mostly only cares about the size of the model and is happy to have the activations run in floating-point.