Posts

KV Cache

1. KV Cache Transformer models generate tokens autoregressively, producing one token at a time, where each new token depends on all previous tokens. In vanilla transformer layers, $$ \operatorname{Attention}(Q, K, V) = \operatorname{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V $$ Assume a sequence “The cat sat on the.” To generate a new token “mat” at time step t, we only need to use “mat” as the query to attend to all previous K and V of “The cat sat on the,” $$ \operatorname{Attention}(Q_t, K_{1:t}, V_{1:t}) = \operatorname{softmax}\left(\frac{Q_tK_{1:t}^T}{\sqrt{d}}\right)V_{1:t} $$ We observe that: ...

Grouped-Query Attention

1. Introduction The attention mechanism is at the heart of Transformer models. Several variants have been proposed to balance model quality and inference efficiency: Scaled Dot-Product Attention. The fundamental building block computes attention scores between queries and keys, scales them to prevent vanishing gradients, and uses the resulting weights to aggregate values: $$ \operatorname{Attention}(Q, K, V) = \operatorname{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V $$ It produces a single $(n, n)$ attention map, which is insufficient to capture fine-grained information across sequences from different perspectives. ...

Rotary Position Embedding

1. Introduction Unlike RNN models, which have inherent order information, Transformer models process tokens without position information. For example, the sentence “the cat sat on the mat” would be indistinguishable from “mat the on sat cat the.” Position encodings solve this problem by injecting order information into the model. Absolute Position Encoding. Each position in a sequence receives a unique representation that is added directly to the token embeddings. Learnable Position Embeddings use a learnable embedding matrix where each position index maps to a trainable vector, limited by the predefined length. Sinusoidal Position Embeddings, introduced in the original Transformer paper, use sinusoidal position encodings as a fixed, deterministic alternative. Several works have found that the performance gap between these two methods is not significant. Relative Position Encoding. Intuitively, what matters is not that a word is at position 67, but that it is 3 positions before the word we’re currently processing. Relative position encoding directly encodes the distance between tokens rather than an absolute position within the given sequence. However, Figure 2 in [1] shows that relative position encoding is slow during inference, and modifying the attention mechanism adds implementation complexity. 2. RoPE: Rotary Position Embedding RoPE encodes position by rotating the query and key vectors in attention through an angle perspective. Consider two tokens at positions $m$ and $n$. RoPE applies rotation matrices $R_m$ to the query $q$ and $R_n$ to the key. The attention score becomes: ...

Byte Pair Encoding

1. Introduction Neural networks cannot process text directly, so we must convert raw strings into embeddings (vectors). Tokens serve as the bridge: raw text is first split into tokens, each mapped to an integer ID that indexes into an embedding look-up table. There are several approaches to tokenization: Word-level tokenization splits text by whitespace and punctuation, treating each word as a token. While intuitive, it cannot handle unseen words, which is a significant limitation in multilingual scenarios where vocabulary can explode. Character-level tokenization uses individual characters as tokens. This guarantees coverage of any word or sentence, but dramatically increases sequence length, which is a particular concern for transformer-based models where computational cost scales quadratically with length. Subword tokenization strikes a balance based on a simple insight: common words should remain intact, while rare words should be broken into meaningful subunits. For example, “tokenizers” might become [“token”, “izers”], preserving semantic information while handling unseen combinations. This keeps vocabulary sizes manageable (typically 30K–50K tokens) while still allowing the representation of arbitrary text. 2. Byte Pair Encoding Byte Pair Encoding (BPE) [1] is a subword tokenization algorithm. It builds a vocabulary through iterative merging. Starting from individual characters, it repeatedly finds the most frequent adjacent pair and merges it into a new token. After thousands of iterations, the vocabulary naturally captures common words as single tokens, frequent subwords as intermediate units, and individual characters as fallbacks for rare patterns. ...