Build an LLM from Scratch 2: Working with text data

Build an LLM from Scratch 2: Working with text data

20.754 Lượt nghe
Build an LLM from Scratch 2: Working with text data
Links to the book: - https://amzn.to/4fqvn0D (Amazon) - https://mng.bz/M96o (Manning) Link to the GitHub repository: https://github.com/rasbt/LLMs-from-scratch This is a supplementary video going over text data preparations steps (tokenization, byte pair encoding, data loaders, etc.) for LLM training. 00:00 2.2 Tokenizing text 14:02 2.3 Converting tokens into token IDs 23:56 2.4 Adding special context tokens 30:26 2.5 Byte pair encoding 44:00 2.6 Data sampling with a sliding window 1:07:10 2.7 Creating token embeddings 1:15:45 2.8 Encoding word positions You can find additional bonus materials on GitHub: Byte Pair Encoding (BPE) Tokenizer From Scratch, https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/05_bpe-from-scratch/bpe-from-scratch.ipynb Comparing Various Byte Pair Encoding (BPE) Implementations, https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/02_bonus_bytepair-encoder/compare-bpe-tiktoken.ipynb Understanding the Difference Between Embedding Layers and Linear Layers, https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/03_bonus_embedding-vs-matmul/embeddings-and-linear-layers.ipynb Data sampling with a sliding window with number data, https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/04_bonus_dataloader-intuition/dataloader-intuition.ipynb