The simplest useful autocomplete: estimate the probability of the next token from the counts of short token sequences seen in training data. No neural network, fast to train, and a strong baseline to beat.
Key idea
Approximate the full history with the previous n β 1 tokens (the Markov assumption):
P(wβ | wβ β¦ wβββ) β P(wβ | wβββββ β¦ wβββ)
A trigram model (n = 3) predicts the next word from the previous two.
Inputs & representation
- Input: the last n β 1 tokens, as discrete symbols (no embeddings).
- Model: a count table with smoothing (KneserβNey, Laplace) to handle unseen sequences.
- Output: a probability distribution over the vocabulary.
# Trigram counts β conditional probability
P(w3 | w1, w2) = count(w1, w2, w3) / count(w1, w2)
How it applies to autocomplete
Given what the user has typed, look up the most probable continuations. Great for short, high-frequency phrases; weak once context matters beyond a few words.
Trade-offs
| Strength | Weakness |
|---|---|
| Trivial to train, very fast | No long-range context (fixed window) |
| Interpretable counts | Sparse β needs smoothing for unseen grams |
| Tiny memory at low n | Table explodes as n grows |
References
- Jurafsky & Martin, Speech and Language Processing, ch. on n-gram LMs.
Notes / TODO
- Build a trigram baseline on our corpus and record perplexity.
- Compare smoothing methods.