Actually worked better than I thought lol
Resources:
Code: https://github.com/ALucek/ft-modernbert-domain
Model: https://huggingface.co/AdamLucek/ModernBERT-embed-base-legal-MRL
Dataset: https://huggingface.co/datasets/AdamLucek/legal-rag-positives-synthetic
Philipp Schmid’s Blog: https://www.philschmid.de/fine-tune-embedding-model-for-rag#3-define-loss-function-with-matryoshka-representation
Matryoshka Representation Learning Blog: https://huggingface.co/blog/matryoshka
MRL Paper: https://arxiv.org/pdf/2205.13147
Chapters:
00:00 - Why Care About Embedding Models
02:41 - Setting the Scene
04:33 - Synthetic Dataset Creation
06:09 - Triplets
08:05 - Formatting our Dataset
08:53 - Choosing a Base Model
10:14 - Evaluation Dataset Prep
12:44 - Matryoshka Representation Learning
15:51 - Creating the Sequence Evaluator
17:00 - Evaluation Metric Breakdown
21:08 - Base Model Evaluation
22:04 - Loading the Model for Training
22:51 - Loss Function Selection
25:05 - Trainer Arguments
26:08 - Training the Model!
26:57 - Comparing Base vs Fine Tune Metrics
28:28 - Using The Fine Tuned Model
#ai #datascience #machinelearning