Data Jam EP 4 - Tokenizer

Overview

Tokenizers are one of the core components of the NLP pipeline. They serve one purpose: to translate text into data that can be processed by the model. Models can only process numbers, so tokenizers need to convert our text inputs to numerical data. –huggingface

  • Understanding GPT tokenizers
  • tiktoken is a fast BPE tokeniser for use with OpenAI’s models.
  • GPT-4 and GPT-3.5 Tokenizer (More Accurate than OpenAI’s)
  • Tokenizer
  • Summary of the tokenizers
  • GitHub - google/sentencepiece: Unsupervised text tokenizer for Neural Network-based text generation.
  • GitHub - huggingface/tokenizers: 💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
  • Byte-Pair Encoding tokenization - Hugging Face NLP Course
  • WordPiece tokenization - Hugging Face NLP Course
  • WordPiece tokenization - Hugging Face NLP Course
  • Byte Pair Encoding and Data Structures - Rust NLP tales
  • Fast WordPiece Tokenization
  • Fast tokenizers’ special powers - Hugging Face NLP Course
  • Why OpenAI’s API Is More Expensive for Non-English Languages - by Leonie Monigatti - Aug, 2023 - Towards Data Science
  • sentencepiece/python/README.md at master
Written on September 26, 2023