What's The Frequency?

Zipfing to some early conclusions.

Dec 01, 2023

Last time out, I mentioned that in Moby Dick, 44% of its words, roughly, were hapax legomena, single-use. That figure of 40-60% is common when we’re dealing with such corpuses as a novel, the New International Version Bible, every James Patterson story ever, or all the closed captions for Modern Family.

But there are two different ways to figure that percentage. Consider this borderline-nonsense sentence:

The quick runners—the runners race, quick, quick—quick, quick.

In one sense, this sentence has four words—the, quick, runners, race—and only one of them, race, is not repeated, so that’s 25%. This is the same sense in which Moby Dick has 44% hapaxes. But in another sense, this sentence has ten words, and only one of them is not repeated, so that’s 10%.

It’s the second sense used in the Moby Dick graph below, which charts the rankings of each word (first most common, second most common, third most common, et cetera) against the actual number of times those words are used. It’s in log scale, meaning that the distance from 1 to 10 is the same as from 10 to 100. Hapax legomena are represented in red. The double-use dis legomena are in blue. On the other extreme are the commonest words, mostly “glue words” that hold almost any writing together like the, of, and, to, and a.

(The commonest word in Moby Dick that is not nearly so common in writing-in-general should be no surprise: it’s “whale.”)

You might notice this line is close to being a straight descending diagonal, making a right triangle with the y- and x-axes. You see that in a lot of different studies like this, too! It’s a visual representation of Zipf’s Law, which goes like this:

When a list of measured values is sorted in decreasing order, the value of the nth entry is inversely proportional to n.

If followed strictly, this would mean the most common word in Moby Dick, “the,” would appear twice as often as the next most common word, “of,” and three times as often as the third most common word, “and.” And so on.

Now, you can see that the rule is not followed strictly. The line is wibbly-wobbly at the top and breaks into clusters of identically ranked words at the bottom, with the biggest such clusters being the hapax and dis legomena in red and blue. But it’s close—and study after study of word frequency keeps coming up with similar lines.

This works with most natural languages and even some conlangs, too, not just English. In fact, it also appears in many studies of frequency distribution that have nothing to do with words or language.

No one knows why.

Still, it’s the variations from the rule that attract my interest the most. I’m fascinated by the idea that statistical modeling could let us imitate the style of old masters like Shakespeare. That idea has its problems: Stephen King might not appreciate some AI company rolling out a KingBot 1.0. But the potential to learn and create with this is amazing, all the same.

Generative AI doesn’t do that well at style imitation yet, but computational linguists can “fingerprint” a style to identify the authors of anonymous or disputed works. They do this not by tracking the “whales” in a writer’s work, but by tracking the frequencies of all their most common words—the, of, a, an, for, and such.

Understanding writing style could come down to understanding this humble little line and how it kinks and breaks.

Tomorrow: a plug!

T Campbell's Grid

What's The Frequency?

Zipfing to some early conclusions.