2 min read

N-gram language models

Table of Contents

The simplest useful autocomplete: estimate the probability of the next token from the counts of short token sequences seen in training data. No neural network, fast to train, and a strong baseline to beat.

Key idea

Approximate the full history with the previous n βˆ’ 1 tokens (the Markov assumption):

P(wβ‚œ | w₁ … wβ‚œβ‚‹β‚) β‰ˆ P(wβ‚œ | wβ‚œβ‚‹β‚™β‚Šβ‚ … wβ‚œβ‚‹β‚)

A trigram model (n = 3) predicts the next word from the previous two.

Inputs & representation

  • Input: the last n βˆ’ 1 tokens, as discrete symbols (no embeddings).
  • Model: a count table with smoothing (Kneser–Ney, Laplace) to handle unseen sequences.
  • Output: a probability distribution over the vocabulary.
# Trigram counts β†’ conditional probability
P(w3 | w1, w2) = count(w1, w2, w3) / count(w1, w2)

How it applies to autocomplete

Given what the user has typed, look up the most probable continuations. Great for short, high-frequency phrases; weak once context matters beyond a few words.

Trade-offs

StrengthWeakness
Trivial to train, very fastNo long-range context (fixed window)
Interpretable countsSparse β€” needs smoothing for unseen grams
Tiny memory at low nTable explodes as n grows

References

  • Jurafsky & Martin, Speech and Language Processing, ch. on n-gram LMs.

Notes / TODO

  • Build a trigram baseline on our corpus and record perplexity.
  • Compare smoothing methods.