Data Jam EP 4 - Tokenizer

Overview

Tokenizers are one of the core components of the NLP pipeline. They serve one purpose: to translate text into data that can be processed by the model. Models can only process numbers, so tokenizers need to convert our text inputs to numerical data. –huggingface

Understanding GPT tokenizers
tiktoken is a fast BPE tokeniser for use with OpenAI’s models.
GPT-4 and GPT-3.5 Tokenizer (More Accurate than OpenAI’s)
Tokenizer
Summary of the tokenizers
GitHub - google/sentencepiece: Unsupervised text tokenizer for Neural Network-based text generation.
GitHub - huggingface/tokenizers: 💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
Byte-Pair Encoding tokenization - Hugging Face NLP Course
WordPiece tokenization - Hugging Face NLP Course
WordPiece tokenization - Hugging Face NLP Course
Byte Pair Encoding and Data Structures - Rust NLP tales
Fast WordPiece Tokenization
Fast tokenizers’ special powers - Hugging Face NLP Course
Why OpenAI’s API Is More Expensive for Non-English Languages - by Leonie Monigatti - Aug, 2023 - Towards Data Science
sentencepiece/python/README.md at master

Written on September 26, 2023