The Next International Search Challenge Isn't Translation. It's Representation.

Working in international SEO, you become accustomed to thinking in terms of scale.

How do you make content discoverable across dozens of markets, languages, cultures and search behaviours?

For years, the challenge was relatively clear.

Could search engines crawl our content?

Could they understand it?

Could they connect the right audience with the right information in the right market?

The mechanics evolved, but the objective remained remarkably consistent: visibility.

Then generative AI arrived.

And at first glance, it appeared to solve one of the biggest international challenges of all.

Translation.

Today, frontier AI models can localise content between English, Japanese, Arabic, Korean and Chinese at a level that would have felt extraordinary only a few years ago. The linguistic gap has narrowed dramatically.^[1]

Most people would look at that progress and conclude that multilingual AI is largely a solved problem.

I'm not convinced.

The more I think about how AI systems retrieve, synthesise and surface information, the more I wonder whether we've solved the wrong challenge.

Because translation and representation are not the same thing.

A customer doesn't care whether an AI model can translate a Japanese article into English.

What matters is whether the knowledge contained within that article is retrieved in the first place.

That distinction may become one of the most important questions facing global organisations over the next decade.

Historically, search engines indexed pages and returned links.

Increasingly, AI systems retrieve passages, rank information, synthesise knowledge and generate answers.^[2]

In that world, discoverability is no longer just about whether content can be found.

It's about whether content becomes part of the answer.

And that's where things get interesting.

Languages don't simply express the same ideas using different words.

They often package information differently.

A Japanese document may communicate information with a different level of linguistic density than its English equivalent. The verb form 検索されていなかった (kensaku sarete-inakatta, "had not been searched") packs passive voice, negation and past tense into a single conjugated unit — what English expresses across five separate words.

Chinese technical content may structure information differently from its English counterpart. Where English leads with subject and verb, Chinese often foregrounds the topic: 这个关键词，竞争对手的排名更高 (zhège guānjiàncí, jìngzhēng duìshǒu de páimíng gèng gāo) — "this keyword, competitor rankings are higher" — a structural inversion that changes which concept enters the retrieval window first.

Arabic can encode nuance through morphology and linguistic patterns that don't map neatly to English. From the single root ك-ت-ب (k-t-b), Arabic derives كِتَاب (kitāb, "book"), كَاتِب (kātib, "writer"), مَكْتَبَة (maktaba, "library") and مَكْتُوب (maktūb, "written") — a morphological family that English represents through entirely unrelated words, with no visible surface connection between them.

Humans navigate these differences naturally.

AI systems do not.

They process information through tokenisation, embeddings, retrieval layers and ranking systems that operate very differently from human cognition.^[3]

To be clear, I'm not suggesting that today's frontier AI systems fail in multilingual environments. Quite the opposite. Their performance is remarkable.

The question is whether all languages are represented equally within the retrieval systems that increasingly mediate access to information.

This question emerged from a rabbit hole I found myself exploring recently.

Like many people working closely with AI systems, I became curious about tokenisation and retrieval. In particular, I started wondering whether information-dense languages such as Japanese, Chinese and Arabic interact differently with modern retrieval systems than English does.^[4]

The deeper I looked, the more I found myself moving away from tokenisation as the primary concern.

The more interesting question may be information density.^[5]

If two passages consume the same retrieval budget, do they necessarily contribute the same amount of knowledge?

And if retrieval systems increasingly determine which information becomes part of an AI-generated answer, what does that mean for organisations operating across multiple languages and regions?

I don't believe we have definitive answers yet.

But I do think we're approaching a moment where international search professionals, content leaders and AI practitioners need to expand the conversation.

For decades, international search strategy focused on ensuring content could be found.

The next chapter may focus on ensuring knowledge can be represented.

That is a very different challenge.

Historically, SEO competed for visibility.

Increasingly, AI systems mediate access to knowledge itself.

Search optimised for pages.

AI optimises for answers.

And if answers become the dominant interface between people and information, then representation may become just as important as visibility.

That raises a question I suspect many global organisations have not yet asked:

As AI becomes the gateway to information, will every language have an equal opportunity to contribute to the answer?

I'm not sure we know the answer yet.

But I increasingly believe it's one of the most important international search questions of the AI era.

Kudo, T. & Richardson, J. (2018). SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing. Proceedings of EMNLP 2018.
Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. This work helped establish retrieval-plus-generation architectures that underpin many modern AI systems.
Research across multilingual natural language processing consistently demonstrates that languages differ significantly in morphology, orthography, segmentation and information encoding, creating distinct computational challenges across writing systems.
Recent multilingual NLP research has highlighted that different writing systems can experience substantially different tokenisation efficiency, affecting context utilisation, computational cost and representation across languages.
Anthropic. (2024). Mapping the Mind of a Large Language Model. Anthropic Research. This interpretability research provides evidence that large language models may internally represent concepts in ways that are not strictly tied to individual words or tokens.