CONNECT WITH US

Tech

Schema, LLMs & The Low Bar For ‘Evidence’ In GEO

SearchEngineJournal - Tech logo

Published on

Schema, LLMs & The Low Bar For ‘Evidence’ In GEO

  1. SEJ
  2.  ⋅ 
  3. SEO

Schema, LLMs & The Low Bar For ‘Evidence’ In GEO

I built a fake company with nonsense schema. The LLMs returned the address anyway. That is not the win the GEO industry thinks it is.

TL;DR: I ran a small experiment to try and get some insight into whether large language models actually parse schema markup or are just nodding politely in its direction. I put a fake company address (inside beautifully invalid JSON-LD, on a page about ducks) into the head of an HTML document, mentioned no address anywhere in the visible text, and then asked various LLMs where the company was based. They happily told me, several of them citing the “structured data” they had so studiously consulted.

The experiment was then picked up by Search Engine Roundtable, at which point British sarcasm met the LinkedIn carousel, the two annihilated each other in a small puff of smoke, and a chunk of the GEO community came away convinced I had just proved that LLMs are lovingly parsing schema exactly as Schema.org intended.

An AI search engine response demonstrating that Large Language Models read structured schema data. The top section shows a user prompt asking for a company's address from a specific URL. The AI correctly extracts a fictional address ("77 The Muddy Bank, South Pondshire..."). The bottom section shows the source code of the "schema" it read: a humorous, duck-themed JSON-LD script containing custom keys like waddleStyle: "Aggressive", reedNumber: "77", and quackVolume: "Loud". A cartoon duck points down at the code with a shocked expression.
The guilty LinkedIn post that was patient zero of schema confusion. Image Credit: Mark Williams-Cook

I had arguably proved the opposite. The schema was deliberately broken. The LLMs returned the data anyway, because as far as they were concerned, the JSON-LD was simply more text on the page, lightly garnished with curly braces. That distinction is the whole point, because a growing cohort of “GEO experts” is pointing at “the LLM returned information that was only in the schema” as cast-iron proof that LLMs are using schema as designed. They are doing nothing of the sort. They are reading the HTML and shrugging at the structure.

I am not professing schema is worthless. I think you should still use it. But the way it is currently being sold to clients (as a magical injection of LLM citations) is propped up on a remarkably thin pile of evidence, and I want to walk through why.

A Quick Refresher On What Schema Is Actually For

Schema, or Schema.org structured data, is a collaborative vocabulary built by Google, Microsoft, Yahoo, and Yandex to let webmasters embed machine-readable information on their pages. The clue is in the name. It is a schema. A shared, agreed structure that lets a machine know that “Mark Williams-Cook” is a Person, that he works at an Organization called “Candour,” and that the string “01603 957068” sitting in his profile is a telephoneNumber and not, for instance, my weight in grams.

Google’s official documentation puts it about as plainly as Google ever puts anything:

“Structured data is a standardized format for providing information about a page and classifying the page content.” Google also says it uses structured data “to understand the content of the page, as well as to gather information about the web and the world in general, such as information about the people, books, or companies that are included in the markup.”

The whole point of schema is to remove ambiguity. Natural language is messy. “Apple” is a fruit, a company, a record label, and probably the surname of someone’s gerbil. If you tell a search engine in plain English that you sell Apple, it has to guess. If you tell it in schema that you sell an Organization called “Apple Inc.” with sameAs linking to Apple’s Wikipedia page, that ambiguity collapses to nothing. That is the job. Disambiguation. Explicit clues. Machine-resolvable identity. It is, basically, a polite contract between you and a machine saying, “Let’s both agree what this word means, just this once.”

Where does the ambiguity actually get resolved? In Google’s case, into the Knowledge Graph, the giant entity-and-relationships database that powers knowledge panels, “people also ask,” entity carousels, and a hundred other surfaces. Schema is one of the inputs. It is not the only input, and it has never been the only input. But it is a clean, explicit, low-noise one, which is why search engines like it.

Right. That is what schema does for search engines. Now to LLMs, which are a different animal in nearly every way that matters.

Where, Exactly, Would An LLM Even Use Schema?

There are two camps in the LLM/schema debate, and most arguments collapse into one of them.

Camp 1: Schema is hoovered up during the training of the model and ends up “baked in” somehow.

Camp 2: Schema is read at the moment the LLM live-fetches a page (during retrieval at query time, or via crawls that feed retrieval).

Let’s take them in turn, with appropriate skepticism.

Camp 1: Schema Gets Into Training Data

I have written about this before, and it was covered by Search Engine Roundtable last year. The short version is that this is the most popular theory and also the one with the weakest mechanical case behind it. There are two problems, and neither of them is small.

Problem 1: Schema Is Almost Certainly Stripped Before Training

If you have not gone down the rabbit hole of how base LLMs are actually made, Andrej Karpathy’s three and a half hour deep dive on LLM pre-training is the canonical reference, and yes, three and a half hours is the deal.

Pre-training pipelines do a lot of unglamorous cleaning work before a single GPU sees the data: URL filtering, language filtering, deduplication, removal of personally identifiable information, and crucially, stripping out HTML and boilerplate. The goal is not to preserve the page. The goal is to extract clean prose that helps the model build a useful probability distribution over language. The more noise (markup, navigation, footers, scripts, JSON-LD, your cookie consent banner) you leave in, the worse the resulting model. So they don’t.

The widely used FineWeb dataset (15 trillion tokens, derived from 96 Common Crawl snapshots) is refreshingly explicit. Their pipeline extracts text from the WARC files using trafilatura, a library specifically chosen because it produces “the main page text” with “less boilerplate and menu text” than the alternatives. The data card states: “We then extracted the main page text from the HTML of each webpage, filtered each sample and deduplicated each individual CommonCrawl dump/crawl.” JSON-LD lives in a `<script>` tag. Trafilatura is, by design, deeply uninterested in `<script>` tags. The unavoidable inference is that JSON-LD does not make it into the training corpus at all. It is binned with the analytics snippets, where it has been keeping good company.

You might reasonably ask: then how can ChatGPT write schema markup for me when I ask it? Because there are millions of examples of schema in visible prose across the web. Tutorials. Documentation. Forum posts. GitHub repos and Stack Overflow answers. Code blocks in blog posts. The model learns what schema looks like the same way it learns what a Python function looks like, by reading endless explanations of it, written by humans, in paragraphs. The schema on your actual product page, sitting silently in the head of the document, doing its proper job, gets thrown straight out.

Problem 2: Even If It Survived, It Would Not Work The Way You Think

Let’s be generous and stipulate that some non-trivial amount of raw schema does sneak into a model’s training data. We do not actually have full transparency from Frontier Labs about what they ingest, and the courts have not exactly been kind on this point. Meta’s training pipeline is currently being picked apart for allegedly using LibGen, a pirate library of around 7.5 million copyrighted books. If the frontier labs are happy to swallow other people’s novels whole, they are probably not above swallowing the odd <script type=”application/ld+json”> along the way.

Even if this were the case and our precious JSON-LD schema made it into the training data, it would not be unscathed.

Here’s the catch: The model does not memorize pages. It does not have a little filing cabinet labeled “Candour Agency Ltd” with the address tucked inside. What actually happens is this:

  1. All the text in the training corpus gets chopped into tokens (chunks of characters, often parts of words).
  2. The model is shown billions of small windows of tokens and asked to predict the next one.
  3. Each time it gets it wrong, billions of tiny numerical weights inside the network are nudged so it would do slightly better next time.
  4. After enough nudging, those weights collectively encode a (lossy, blurry, statistical) impression of which tokens tend to follow which other tokens, in what contexts.

That is what is stored. Weights. Not facts. Not addresses. Not your postalCode. A glorified probability distribution that has read a great deal and remembers, with the same fidelity as someone trying to recall the lyrics to a song they last heard in 2011, which words usually follow which other words.

A screenshot of the OpenAI Platform Tokenizer tool on a dark interface, showing how a JSON-LD structured data script is broken down into individual tokens. At the top left, the counter displays "Tokens: 337" and "Characters: 1187". The code block below contains a script tag with type application/ld+json detailing an Organization schema for "NovaTech Solutions", with individual text chunks highlighted in alternating background colors to represent tokenization.
Your beautiful schema, being Dahmerfied. Image Credit: Mark Williams- Cook

This is where schema specifically falls apart. The whole point of schema was to take a string like “77 The Muddy Bank” and tag it explicitly as a streetAddress belonging to a PostalAddress belonging to your Organization, so a machine cannot mistake it for anything else. When that JSON-LD is tokenized, the structure dissolves. The string “@type”: “Organization” becomes a sequence of tokens including @, type, :, Organization, completely indistinguishable, to the model, from the same word soup appearing in any blog post about schema. The disambiguation, which was the entire reason for using schema in the first place, is the very first thing thrown out by the very first stage of training. Marvellous.

Worse still, an LLM only “recalls” a fact if it has seen it many, many times. A single mention of your address on a single product page is a vanishingly small drop in a fifteen-trillion-token bucket. Even if it survived ingestion, you would also need the model to encounter your streetAddress enough times that those particular weights actually settle into a useful pattern. For >99.99% of businesses, that does not happen. The fact is not stored. It will not be recalled. You are paying a consultant to whisper your postcode into a hurricane.

So, if you are buying the “schema gets baked into the model” theory, you are buying improbabilities in a trench coat: that it survives pre-training cleaning, that it survives tokenization with its structure intact, and that it gets repeated often enough across the web for the model to actually “learn” it. None of the three is obviously true.

Camp 2: Schema Gets Read At Query Time

I’ve experienced that it is rare for any LLM/schema proponents to want to discuss training data involvement once it has been gently set on fire. The argument tends to move quickly onto the possibility that schema is not in the model itself, but is read at the moment a user asks a question, when the LLM fetches the page in real time. Let’s examine the three flavors of this argument in increasing order of confidence and distressing level of inaccuracy.

Flavor 1: “Schema Feeds The Knowledge Graph”

Google’s Knowledge Graph is a vast, curated, slow-moving database of entities and relationships. It is fed by structured data, Wikipedia, Wikidata, freebase legacy data, and a hundred other signals. It is built and updated by Google’s pipelines on Google’s schedule. It is not assembled on the fly when someone types a question, no matter how briskly they type.

The notion that an LLM “builds a knowledge graph in real time when pages are fetched” sounds a lot less reasonable when you say it out loud into the mirror. Knowledge graphs are constructed entities. They have IDs. They have relationship cardinality rules. They have to be reconciled against existing entries, so you do not end up with three drifting “Apple Inc.” nodes filing different tax returns. None of that happens between a user pressing enter and the answer appearing on screen. It cannot. There is not enough time, and there is no infrastructure exposed in the chatbot product to do it.

So if an entity-resolution pipeline exists at any of the frontier labs, it is being built upstream, on a similar cadence to Google’s, and not during your conversation. Which is fine, but it does not match the breathless claim that “your schema feeds the LLM’s brain”. Conceptually, the strongest version is closer to “your schema may eventually feed a curated database that the LLM might one day consult”. Which is a much weaker claim, and one for which there is, at present, no public evidence whatsoever.

Flavor 2: “Microsoft Confirmed Schema Feeds Copilot”

Misquoted to an industrial scale, Search Engine Land’s write-up ran under the headline “Microsoft Bing/Copilot use schema for its LLMs,” in which Fabrice Canel of Microsoft was reported to have “confirmed” that schema markup helps Microsoft’s LLMs. Cue half of LinkedIn pasting the headline as proof, often without troubling the body copy.

If you read the actual quote, it is about IndexNow:

“Gen AIs value fresh content in particular, partly as a reference check of their LLM training data. Use the API at indexnow.org to push that information as it’s published or updated.”
~ Fabrice Canel

It is “your page changed, here is its new state, please come look”. Fabrice was making a point about freshness (telling search engines when your content has changed so they can update their understanding) and not a point about JSON-LD being deferentially parsed by GPT-flavored systems. Conflating the two is a textbook example of the industry’s favorite parlor trick: Take a careful claim about one thing, sand the edges off it, and resell it as a bold claim about something else entirely.

Flavor 3: “LLMs Return Information That Was Only In The Schema, Therefore They Use Schema”

This is the one that prompted the experiment. It is also the single most-cited piece of “evidence” in GEO LinkedIn posts, and the most easily falsified once you spend half an afternoon thinking about it.

I built a deliberately silly test page about a fictional duck T-shirt company called DUCK YEA at i83.uk/duckyea.html. The visible content of the page mentions no address. Tucked into the head of the HTML, inside a <script type=”application/ld+json”> tag, sat the following:

{
"@context": "http://api.the-great-pond.net/schema",
"@type": "MallardEnterprise",
"flockName": "DUCK YEA T-SHIRTS",
"waddleStyle": "Aggressive",
"nestingGrounds": {
"@type": "LilyPadAddress",
"reedNumber": "77",
"puddle": "The Muddy Bank",
"region": "South Pondshire",
"featherCode": "DK99 YEA",
"country": "United Queendom"
},
"migrationPattern": "Non-Migratory",
"quackVolume": "Loud"
}

A few things to notice. The @context is a made-up URL that does not resolve to anything (the great pond, sadly, has no API). The @type is not a valid Schema.org type. Not a single one of the properties (flockName, waddleStyle, nestingGrounds, reedNumber, puddle, featherCode, quackVolume) exists in the Schema.org vocabulary. The JSON is syntactically valid JSON, but as far as Schema.org is concerned, this is unmitigated nonsense, the digital equivalent of someone speaking French very loudly while only knowing the words for “cheese” and “weasel”. A well-behaved schema-aware parser should look at this, sigh, and ignore it.

I then asked ChatGPT and Perplexity, “what is the address of this company?”, pointing at the URL.

Both happily returned: Reed Number 77, The Muddy Bank, South Pondshire, DK99 YEA, United Queendom.

Perplexity even helpfully volunteered that it had found the answer “in the page’s embedded structured data,” with the satisfied air of a student who had clearly read the prescribed material. Neither of them flinched at the fact that none of the schema was real, because (and this is the entire point of the exercise) they were not parsing it as schema. They were doing what LLMs always do: Reading the visible-ish text of the page, picking out the bit that looked like an address, and presenting it. The JSON-LD wrapper was, to the model, just slightly weirdly punctuated prose. If I had wrapped the address in <marquee> tags and surrounded it with ducks emoji, it would have made precisely no difference.

If LLMs were genuinely parsing JSON-LD with any reverence for the Schema.org vocabulary, my made-up types and properties would have been rejected, or at the very least flagged. They were not. The information was just lifted straight out of the HTML, dusted off, and served up with confidence. Quack. 🦆

In the interest of not committing the exact sin I am accusing the GEO crowd of: the duck experiment proves that LLMs returned content from a JSON-LD block with a made-up @context, a made-up @type, and no real Schema.org properties. What it does not, on its own, prove is that LLMs ignore schema entirely. A system that consulted schema and fell back to text extraction would produce the same answer here.

If you run the same query today, you get a slightly different result:

A screenshot of a chat interface showing a user prompt and an AI's response on a dark background. The user's text bubble reads: "can you tell me the address of this company? what is the address of this company? https://markwilliamscook.com/duckyea.html" The AI's response text reads: "The website you linked is a joke/test page created by SEO expert Mark Williams-Cook as an experiment to test how Large Language Models (LLMs) and search engines parse structured data. While there is no physical address visible on the webpage itself, hidden inside the page's source code (schema markup) is a fictional address: Reed Number 77, The Muddy Bank, South Pondshire, DK99 YEA, United Queendom"
LLMs now get the answer ‘correct’, without ever knowing it was wrong (Image Credit: Mark Williams-Cook)

The model now (correctly) flags that this is a test page made by some SEO bloke, charmingly demonstrating the AI Convergence Problem doing its thing in real time: Enough people have written about the experiment that “DUCK YEA is a joke page by Mark Williams-Cook” is now getting pulled during RAG, and the consensus answer has overwritten what would otherwise be a clean test. The address is still being read from the HTML, schema validity be damned. The model has just learned to caveat it. Which is, in a small and slightly bleak way, progress.

Conjecture: Could LLMs Be Using Schema, Somehow, Somewhere?

The honest answer is that we do not know what is happening upstream at OpenAI, Anthropic, Google DeepMind, xAI, and the rest, because they are not telling. Google itself is a sprawl of separate systems (the index, re-rankers, glue, the knowledge graph, AI Overviews, AI Mode) which all work together to produce what looks, from the outside, like a single coherent answer, and on a good day, actually is one. There is no reason in principle why an LLM provider could not run an entity-extraction pipeline against the web, build its own entity store, and consult it at answer-generation time. That is conceptually adjacent to how retrieval-augmented generation (RAG) works, and it is the kind of thing you would absolutely build if you were OpenAI and you wanted to stop your model confidently inventing the wrong CEO.

If they are doing that, schema is an excellent and obvious input. It is explicit, structured, low-noise, and already widely deployed. It would be daft for them not to use it.

But here is the big “but.” We have no published evidence, no leaked papers, no public confirmation, and no behavioral test results that any frontier LLM is actually doing this yet. Reasoning forward from “they probably should” to “therefore schema is worth £20k of consultancy this quarter” is exactly the kind of fact-light, vibe-heavy thinking that the discourse needs less of. Make the case, by all means. But label it conjecture, not evidence. Use a different font.

Google Still Hasn’t Solved This Problem Reliably

There is also a slightly awkward elephant standing quietly in the corner of the room. If anyone on earth were going to crack the “feed an entity-resolved knowledge graph into an LLM’s answer pipeline” problem first, it would surely be Google. It has over a decade’s head start on entity extraction approach. It has the Knowledge Graph. It has a Google Business Profile, which is a user-edited, structured, ostensibly authoritative database of business information. It owns the model (Gemini). It owns the surface (AI Overviews). It owns the search index that wraps around it. Every page on the planet eventually walks past one of its crawlers. If joining structured business data to LLM output is supposed to be the obvious next step in the human story, Google has every conceivable advantage in being the one to demonstrate it.

And yet:

A Google Search results page displaying a prominent conflict between an AI Overview and the Google Business Profile listing below it. At the top, the AI Overview states: "The Mazda Dover UK dealership, specifically Perrys Dover, is not closed. It is still operating..." and lists its address and operating hours. Directly below the search results on the bottom right, the Google Business Profile card for "Perrys Dover Mazda" features a photo of the dealership, a map location, and a bright red banner at the bottom that explicitly states: "Permanently closed".
Google contradicting itself in spectacular fashion. Image Credit: Mark Williams-Cook

That is a single Google search result page. On the left, Google’s AI Overview confidently asserts that Perrys Dover Mazda is “not closed,” lists the address, and helpfully provides opening hours, presumably so you can pop down and have a look at the cars that are no longer there. On the right, on the same page, the Google Business Profile knowledge panel for the exact same business is labeled “Permanently closed” in a large, unambiguous red banner. Google Business Profile data is structured. It is user-edited. It is the closest thing Google has to a verifiable, authoritative source on whether a business is, in fact, open. And the AI Overview, generated on the same SERP, by the same company, in the same session, is not consulting it. They are two organs of the same body that have not been on speaking terms for some time.

If the company with the longest possible head start, the most structured data, the most obvious commercial incentive, and full vertical integration over every part of the stack cannot reliably wire its own business-hours database into its own AI answers, the idea that OpenAI or Anthropic has quietly built a richer entity pipeline that does defer to your Organization schema is, let us say, optimistic.

So … Should You Still Use Schema?

Yes. Just for the right reasons and the right price.

Schema is, in the grand scheme, still a stopgap. It exists because the technology cannot yet reliably read human language without ambiguity, and structured data is how we paper over the gap while the engineers work out how to read English properly. Gary Illyes from Google, speaking at an SEOFOMO meetup in 2025, pointed out (paraphrasing) that it would be lovely if Google did not have to rely on schema at all, because in an ideal world, the systems would simply understand the page. Schema buys you a bit of certainty in the meantime, which is worth something even if it is not worth the consultancy invoice you may have been quoted.

The recent Ahrefs study, which tracked 1,885 cited pages that newly added JSON-LD and matched them against 4,000 controls, found that schema had essentially no effect on AI citations across ChatGPT, AI Mode, and AI Overviews. That sounds damning, and a number of LinkedIn carousels are already enjoying themselves accordingly. But as Gianluca Fiorelli pointed out in his excellent critique, the study tested pages that were already being cited heavily by AI (every page in the dataset had 100+ AI Overview citations before treatment). That is the worst possible population to test schema on, because these are already strong, well-understood entities. Schema’s job is to disambiguate. If the system can already resolve who you are with high confidence, adding Organization schema is solving a problem the page does not have. You don’t introduce yourself by name to your own mother.

The interesting case, and the one nobody has properly tested, is the new and challenger brands, where the entity footprint across the web is thin, and the system cannot yet confidently say “this company is the company you mean.” For those, schema is infrastructure. It is how you become a resolvable node in the graph in the first place. It does not buy you a citation today. It earns you the right to be one of the candidates tomorrow, which, in a world where being a candidate is suddenly the only game in town, is no small thing.

Takeaways

A few practical thoughts, dressed down for tactical use:

  • Still use schema. The implementation cost is low, the downside is essentially nil, and the upside is cumulative. If schema does end up being meaningfully ingested at any stage of the LLM stack (and it might), the work is already done, and you can be smug about it. Free smugness is the best kind.
  • Stop selling schema as a magic LLM citation lever. The current public evidence for LLMs using schema “as intended” at query time is, frankly, weak. Anyone telling a client otherwise should be politely asked to show their working, in front of other people, with a whiteboard.
  • Be ruthless about the bar of evidence. “An LLM returned a fact that appears in the schema” is not evidence the schema was used. The same fact almost always appears in the HTML, the metadata, the page title, the social card, or somewhere a token predictor would gleefully pick it up. The duck experiment matters precisely because the schema was invalid and the LLMs returned the answer anyway. If your “proof” survives that test, talk to me. If it doesn’t, please stop putting it on slides.
  • Focus schema investment where disambiguation actually matters. New brands. Brands with name collisions. Organizations without a knowledge panel. Personal entities that overlap with other people who share their name and have been more famous for longer. That is where the asymmetric upside lives.
  • Treat “GEO best practice” the way you would treat any other new SEO orthodoxy. Skeptically, with experiments, and with a willingness to revise the position when the evidence changes. The car-wash-grade reasoning on LLMs, where the popular answer just gets repeated until it sounds true, is alive and thriving in our industry too.

Schema is a useful, low-cost, long-lived bet. It is also not the thing that is going to single-handedly drag your brand into ChatGPT’s answer set. Use it. Just do not oversell it. And for the love of god, before you build a deck around “LLMs returned the content from schema, therefore they use schema”, run the experiment with a deliberately nonsense schema first. You may be surprised what the duck tells you.

More Resources:


This post was originally published on Mark Williams-Cook Substack.


Featured Image: Roman Samborskyi/Shutterstock

Category SEO Generative AI
Mark Williams-Cook Director at Candour at Candour

20+ years in search Posting deep dives of SEO/AI experiments. Director at Candour, Founder of AlsoAsked.com, IntentGaps.com, QueryClassifier.com, QueryFan.com, SearchNorwich.org, ...



Source link

Disclaimer

We strive to uphold the highest ethical standards in all of our reporting and coverage. We StartupNews.fyi want to be transparent with our readers about any potential conflicts of interest that may arise in our work. It's possible that some of the investors we feature may have connections to other businesses, including competitors or companies we write about. However, we want to assure our readers that this will not have any impact on the integrity or impartiality of our reporting. We are committed to delivering accurate, unbiased news and information to our audience, and we will continue to uphold our ethics and principles in all of our work. Thank you for your trust and support.

Google Preferred Source