3 min read

Transformers

Table of Contents

Transformers replace recurrence and convolution with self-attention: every token can directly attend to every earlier token. This is the architecture behind modern autocomplete and GPT-style models.

Key idea

Each token forms a query, key, and value; attention weights every other token by relevance and mixes their values:

Attention(Q, K, V) = softmax(QKα΅€ / √dβ‚–) Β· V

A causal mask prevents attending to future tokens, so the model predicts the next token from the past only.

Inputs & representation

  • Input: token embeddings + positional encodings (attention is order-agnostic without them).
  • Model: stacked masked self-attention + feed-forward blocks.
  • Output: next-token distribution at every position.
tokens + positions
      ↓
[ masked self-attention ] ←→ all earlier tokens
      ↓
[ feed-forward ] β†’ next-token logits

How it applies to autocomplete

Run the prefix through the decoder; the last position’s logits rank completions. KV-caching makes incremental typing cheap β€” reuse past keys/values and only compute the new token. This is the strongest quality option, at higher compute cost.

Trade-offs

StrengthWeakness
Best quality, long-range contextAttention is O(nΒ²) in sequence length
Fully parallel trainingLarger memory / compute footprint
KV-cache β†’ fast incremental decodeNeeds positional encoding & more data

Decoding strategies

How you turn logits into a suggestion matters as much as the model:

  • Greedy / top-k / top-p (nucleus) β€” speed vs. diversity trade-off.
  • Beam search β€” better for short, high-confidence completions.

References

  • Vaswani et al. (2017), Attention Is All You Need.
  • Radford et al., the GPT series (decoder-only LMs).

Notes / TODO

  • Fine-tune a small decoder-only model on our corpus.
  • Benchmark KV-cache latency for live typing.