What do readability scores actually measure?

Apr 22

When most people first come across readability testing, the experience is oddly reassuring. Almost a perfect solution for making sure your copy is likely to be understood by your audience. You paste in some text, a formula gives you a number, and suddenly a very fuzzy problem starts to look measurable.

That is part of the appeal.

If you can improve the readability score through some redrafting, it feels like progress. In some cases, it probably is progress. Shorter sentences usually help. Fewer syllables often help. But as with my first experiences using AI tools to “fix” customer communications, the deeper you go, the more you start to realise that first impressions can be deceptive.

At first glance, a readability score can feel wonderfully concrete. A bit like a teacher marking homework. Here is your number. Here is your grade level. Job done.

The trouble is that this only works if you understand what that number is actually measuring, and just as importantly, what it is not measuring.

That sounds obvious, but in practice, many firms cite readability scores as if they are self-explanatory. They are not. They are formulas. Useful formulas, in the right place. But formulas, nonetheless. And if you do not understand what a formula measures, you do not understand what it misses.

That is where the real journey starts, especially when you are looking through the lens of Customer Understanding.

What are the main readability scores actually measuring?

The first surprise for many people is that there is not one readability score. There are many, and their names don’t exactly trip off the tongue: Flesch Reading Ease, Flesch-Kincaid Grade Level, Gunning Fog, SMOG, ARI and Dale Chall to name just a few. They often get spoken about as if they are all doing roughly the same thing, but they are not. They are looking at text through slightly different lenses, which is precisely why they sometimes disagree.

Flesch Reading Ease and Flesch Kincaid both rely heavily on sentence length and syllables per word, but they express the result differently. One gives you an ease score, the other converts the same broad ingredients into a US school grade level. Gunning Fog cares about sentence length, but places particular weight on so-called complex words, usually words with three or more syllables. SMOG does something similar but was designed around a 30 sentence sample and has long been treated as a more conservative measure. ARI takes yet another route and uses characters per word rather than syllables, which can shift the result again. Rather than focusing mainly on syllables or characters, Dale Chall looks at sentence length and the proportion of words that fall outside a familiar word list. It tries to capture difficulty through unfamiliar vocabulary as well as structural complexity. So even before you get into the weaknesses of readability testing, you are already dealing with a family of diverse measures rather than one universal truth.

That matters because people often aim to “get the text down to a 12-year-old reading age” as if that is a single, universal finish line. It is not. Different formulas can give you different answers on the same passage, because each formula is sensitive to different surface signals in the writing. A text can look improved on one score while remaining awkward, vague, or cognitively tiring in ways that another score might catch.

And that takes you to the second surprise. These formulas are old, deliberately simple, and mostly concerned with surface features of the text. That is not a criticism. It is simply what they were designed to do. They are quick proxies. They are not deep models of understanding. Sentence length matters. Word length matters. Syllable density matters. Of course they do. But anyone who has spent serious time reviewing customer communications knows that this is only the start of the problem, not the end of it.

A short sentence can still be confusing. A familiar-looking paragraph can still hide its real point. A defined term can be technically precise but mentally slippery. A passage can score better after editing simply because the sentences have been chopped up, while the logical burden on the reader has actually increased. In other words, text can become shorter without becoming clearer. Bruce made this point decades ago when showing how a passage could be rewritten into shorter sentences while forcing the reader to do more inferential work. That is exactly the sort of trap surface readability formulas can miss.

The wider problem: Reading burden, not just readability

That is the point at which you stop thinking only about readability scores and start thinking about reading burden more broadly.

For CUE, that was where the subject really opened up.

Because once you move past the comfort blanket of a single score, you start to see at least three different layers of difficulty sitting inside a piece of text.

First, there is the traditional readability layer. This is the world of sentence length, syllables, characters, and ratios. It is not useless at all. In fact, it is often a very good first warning sign. If your average sentence is sprawling and your words are consistently long, something is probably wrong.

Second, there is the plain English and structural layer. This is where things get much more interesting. Now you are looking at whether the sentence structure is overloaded, whether clauses are stacked awkwardly, whether the subject gets buried, whether the logic turns back on itself, whether conditions and exceptions are layered in ways that increase processing strain, whether the flow forces the reader to hold too much in working memory, and whether the wording is doing unnecessary cognitive damage quite apart from the syllable count. This is where much of the real work starts. A formula based on sentence length alone cannot tell you whether the structure is helping the reader or quietly working against them.

Third, there is the lexical and psycholinguistic layer. This is the one that many firms barely touch, even though the FCA has for years used words like “familiar” when talking about customer communications. But what does familiar actually mean? Familiar to whom? Familiar because it is short? Familiar because it is frequently encountered? Familiar because it is learnt early in life? Familiar because it is easy to picture? Familiar because it is concrete rather than abstract? Once you start pulling on that thread, you realise that “familiarity” is not one thing at all. It opens into a large body of psycholinguistic research on age of acquisition, concreteness, imageability, and frequency.

That is where the subject stopped being a tidy writing problem and became something far richer.

Take age of acquisition. A word learnt early in life tends to be processed differently from one learnt much later, even when both are technically known. Or take imageability and concreteness. Some words are easy to picture and mentally grasp. Others are abstract, detached, and much harder to hold onto, even when they sound perfectly respectable on the page. Then there is frequency. Corpora such as SUBTLEX-UK, based on British television subtitles, exist precisely because raw word frequency turns out to matter, and subtitle-based frequencies can perform surprisingly well as indicators of the words people are likely to encounter in everyday language.

By this stage, the original dream of “just run a Flesch score and tidy a few long sentences” starts to look wonderfully innocent. Because what began as a single number has become a set of overlapping lenses, each with its own strengths, weaknesses, and blind spots.

From one score to a full evidencing framework

That leads to another awkward truth. There is no single universally accepted plain English “banding system” that tells you, with scientific finality, that one score is fit for investors with basic level of knowledge/experience, or another is relevant for more Informed/Sophisticated investors. Firms often behave as if these boundaries are simply out there waiting to be downloaded. In reality, if you want robust audience thresholds, you end up having to calibrate them against real retail clients and real outcomes. Otherwise, you are still leaning on educated proxies, not evidence.

And of course, once you decide to do all this properly, a final problem appears. Delivery.

It is one thing to write a blog explaining that readability is multi-layered. It is quite another to build a repeatable system that can test text across multiple readability formulas, sentence-level plain English checks, structural logic checks, lexical burden checks, psycholinguistic measures, and calibration frameworks, while also preserving meaning and leaving an audit trail that a regulated firm can actually rely on.

That is where the romance starts to wear off.

Because yes, AI can help. It can help a lot. But if you want deterministic, controlled, regulator-ready outputs, you quickly discover that asking one large prompt to “judge readability properly” is a recipe for drift, inconsistency, and false confidence. You end up needing layers of logic, layers of checks, and often hundreds of tightly constrained prompts just to do something that sounded, in theory, straightforward.

That was one of the deeper lessons for me in building CUE.

Traditional readability scores still have real value. They are useful warning lights. They are often good at picking up obvious surface difficulty. But they are only the front door. Once you step through it, you find sentence structure, logical burden, lexical familiarity, abstraction, imageability, acquisition age, calibration, and audience context all waiting for you.

Readability is not one score. It is not one formula. It is not one prompt. And it is certainly not one neat shortcut to customer understanding.

So when someone says a document is “readable” because the Flesch Kincaid score improved, my first reaction now is not to disagree. It is simply to ask a much more awkward question.

Readable in what sense?

zak de mariveles

What do readability scores actually measure?

What are the main readability scores actually measuring?

The wider problem: Reading burden, not just readability

From one score to a full evidencing framework

Can AI really judge whether text is easy to understand?