Can AI really judge whether text is easy to understand?

Most people’s first instinct is the same: upload the document, ask AI to simplify it, and assume the problem is solved. This article explains why that approach breaks down in regulated communications, and why reliable customer understanding evidence needs far more than a clever prompt.

When Consumer Duty first arrived, I suspect many of us in regulated firms had the same thought. Surely, when it comes to Customer Understanding, I could just use AI to work its magic and make my literature more retail-friendly.

How difficult could it be? Upload the brochure into ChatGPT, write a quick prompt asking it to make the text easier for a retail customer to understand, and hey presto, job done.

And at first glance, it looked perfect. The document sounded cleaner, more polished, and GPT was confidently telling me it had done exactly what I asked. In fact, she seemed so sure of herself that I went to bed that night feeling rather smug, a seemingly awkward new regulatory requirement done and dusted.

But somewhere in the back of my mind I heard a quiet voice whisper, “if it is too good to be true…” And the next day, as I began to kick the tyres and move from curiosity to evidence, the cracks started to appear.

That was when I realised there is a very big difference between an AI response that feels impressive, and a process that I could put in front of compliance and legal teams with a straight face.

Why a simple prompt is not enough

Over the next eighteen months, as I began to understand Large Language Models, or LLMs, more and more, I learnt what is probably the single biggest lesson that still guides me now. AI is heuristic, not deterministic.

In plain English, excuse the pun, ask AI the same question twice and you can get two meaningfully different answers. Give it the same document on two different days and see different judgments about what is clear, familiar, or too complex. Watch it simplify one clause carefully, then ignore a similar issue in the next paragraph for no obvious reason. See it skip over sections entirely because it seems to prioritise speed or brevity. See it confidently state things that are not actually grounded in the source text. See it miss structural issues one moment, then suddenly detect them the next.

Even tasks that sound simple, such as counting words after a rewrite or preserving a sentence’s meaning, turned out to be far less reliable than many people would imagine.

That is not because LLMs are broken. It is because that is not how they are designed to work. An LLM is not a rules engine. It does not “know” in the way we often assume. It predicts. It works probabilistically, generating the next most likely sequence based on patterns learned from vast amounts of text.

In other words, AI can be wrong. And what worried me most was not that it got some things wrong. It was how confident it sounded when it did.

That distinction between confidence and correctness matters enormously in financial communications. If you are rewriting an internal email, some variability is harmless. If you are rewriting customer facing literature explaining risks, benefits, exclusions, or eligibility, variability is not a feature. It is a problem.

In this context, text is not better simply because it sounds smoother. It has to preserve meaning. It has to preserve legal and commercial accuracy. It must not simplify away a condition that matters, quietly drop a troublesome sentence, or replace a defined term with a looser everyday phrase that alters precision. It must not insert assumptions that merely sound likely. It must not quietly change tense, scope, threshold, or causality. And whatever it does, it needs to do consistently, not just occasionally.

Why the real work sits behind the scenes

That is where the naïve idea of “just prompt GPT” begins to collapse. Once you start looking seriously at what is required, you realise the real work is not just about the prompt. The real work is everything sitting behind it.

It starts with the source text itself. Documents are rarely clean objects. PDFs and client files often contain hidden characters, formatting noise, and extraction issues that can distort the text before the model has even started reading.

Then comes the instruction design. Tell a model to “make this easier to read” and you leave far too much room for guesswork, shortcuts, and false confidence. But pile in too much instruction and parts of it start getting ignored.

Then there is the subject matter research. If you want to control an AI properly, you need to be very knowledgable about what good actually looks like.

Then come the guardrails, which are often the hardest part of all. It is not enough to tell the model what to improve. You also have to tell it what it must never change, and that means understanding its habits and weak spots in detail.

After that comes the checking, and then checking again. Outputs need to be tested against anchors so unsupported rewrites can be spotted rather than waved through.

Then there is the scoring logic. If you want evidence, you cannot rely on a model merely feeling that something is simpler.

Then there is calibration. Easy enough for one audience is not necessarily easy enough for another.

Then there is human review. In a regulated setting, there still needs to be four eyes over the final result before anything goes out to clients.

And finally, there is the audit trail. A regulated firm needs to know why each change was made, not just that it was made.

That, in truth, is what CUE grew out of. What started life as a simple question, can AI help make retail communications easier to understand, became an eighteen-month exercise in unpicking what can, and just as importantly what cannot, be relied on in practice.

Today, what once looked like a single prompt has turned into a heavily structured process. What originally felt like one rewrite instruction now involves layers of analysis, scoring, checking, controls, and review, all designed to achieve what sounded like the same simple objective: make this easier for retail customers to understand.

Why evidence matters more than confidence

The real challenge is not getting AI to say something about text. The real challenge is creating a system that constrains what it says, checks what it says, measures what matters, preserves meaning, and leaves an audit trail strong enough to support a real-world decision.

That matters because it changes the role of AI. Instead of asking the model to be the final judge, CUE uses AI as one controlled component inside a wider evidencing framework. Letting AI mark its own homework should never be enough on its own. A regulated firm does not need unstructured editing, no matter how good it looks on the surface. It needs evidence it can stand behind.

And under Consumer Duty, that need is becoming harder to ignore. The FCA’s customer understanding outcome makes clear that firms should ensure communications are likely to be understood by customers and equip them to make effective, timely, and properly informed decisions. That is a long way from “I asked ChatGPT and the answer looked good.”

The deeper lesson for firms is this. AI can absolutely help judge whether text is easy to understand, but only if you stop expecting magic and start building discipline around it.

That is not a criticism of AI. It is simply what serious use of AI looks like in a regulated environment.

The irony, perhaps, is that the original ambition has not really changed. The goal is still the same one many of us had when we first opened a GPT window and uploaded a client document: make this easier for retail customers to understand.

What has changed is our understanding of what it takes to do that properly.