1. Introduction Neural networks cannot process text directly, so we must convert raw strings into embeddings (vectors). Tokens serve as the bridge: raw text is first split into tokens, each mapped to an integer ID that indexes into an embedding look-up table. There are several approaches to tokenization:
Word-level tokenization splits text by whitespace and punctuation, treating each word as a token. While intuitive, it cannot handle unseen words, which is a significant limitation in multilingual scenarios where vocabulary can explode. Character-level tokenization uses individual characters as tokens. This guarantees coverage of any word or sentence, but dramatically increases sequence length, which is a particular concern for transformer-based models where computational cost scales quadratically with length. Subword tokenization strikes a balance based on a simple insight: common words should remain intact, while rare words should be broken into meaningful subunits. For example, “tokenizers” might become [“token”, “izers”], preserving semantic information while handling unseen combinations. This keeps vocabulary sizes manageable (typically 30K–50K tokens) while still allowing the representation of arbitrary text. 2. Byte Pair Encoding Byte Pair Encoding (BPE) [1] is a subword tokenization algorithm. It builds a vocabulary through iterative merging. Starting from individual characters, it repeatedly finds the most frequent adjacent pair and merges it into a new token. After thousands of iterations, the vocabulary naturally captures common words as single tokens, frequent subwords as intermediate units, and individual characters as fallbacks for rare patterns.
...