Byte-pair encoding, merge rules, subword splits
Hard AdvancedImplement a simple BPE tokenizer โ the foundation of GPT/LLaMA tokenization.
1. Split each word into characters + </w> end marker
2. Count all adjacent pairs across the corpus
3. Merge the most frequent pair into a single token
4. Repeat for num_merges iterations
Implement the function below. Use only basic PyTorch operations.
Use this code to debug before submitting.
Try solving it yourself first! Click below to reveal the solution.
For interactive practice with auto-grading, run TorchCode locally:pip install torch-judge then use check("bpe")
Byte-pair encoding, merge rules, subword splits