Data Jam EP 4 - Tokenizer
Overview
Tokenizers are one of the core components of the NLP pipeline. They serve one purpose: to translate text into data that can be processed by the model. Models can only process numbers, so tokenizers need to convert our text inputs to numerical data. –huggingface
- Understanding GPT tokenizers
- tiktoken is a fast BPE tokeniser for use with OpenAI’s models.
- GPT-4 and GPT-3.5 Tokenizer (More Accurate than OpenAI’s)
- Tokenizer
- Summary of the tokenizers
- GitHub - google/sentencepiece: Unsupervised text tokenizer for Neural Network-based text generation.
- GitHub - huggingface/tokenizers: 💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
- Byte-Pair Encoding tokenization - Hugging Face NLP Course
- WordPiece tokenization - Hugging Face NLP Course
- WordPiece tokenization - Hugging Face NLP Course
- Byte Pair Encoding and Data Structures - Rust NLP tales
- Fast WordPiece Tokenization
- Fast tokenizers’ special powers - Hugging Face NLP Course
- Why OpenAI’s API Is More Expensive for Non-English Languages - by Leonie Monigatti - Aug, 2023 - Towards Data Science
- sentencepiece/python/README.md at master
Written on September 26, 2023