Course playlist: https://www.youtube.com/playlist?list=PLw3N0OFSAYSEC_XokEcX8uzJmEZSoNGuS
We look at the problems of the previous bag-of-words approach, then use an improved technique (TF-IDF) to overcome them. In the demo, we'll use spaCy and scikit-learn to build TF-IDF vectors and build a simple document search engine.
Colab notebook: https://colab.research.google.com/github/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_vectorization.ipynb#scrollTo=CnC_i4oH2ARW
Timestamps:
00:00:00 TF-IDF
00:00:15 The problem with binary/frequency bag-of-words
00:01:03 Using relative frequency instead
00:01:50 Term Frequency (TF)
00:03:14 Inverse Document Frequency (IDF)
00:03:54 Getting a word's TF-IDF score
00:04:52 Variations of TF-IDF
00:05:49 DEMO: creating TF-IDF vectors with scikit-learn
00:08:41 DEMO: querying a corpus and ranking results
00:11:04 Benefits and shortcomings of TF-IDF
This video is part of Natural Language Processing Demystified --a free, accessible course on NLP.
Visit https://www.nlpdemystified.org/ to learn more.