What we measured, and what we found
For a quarter we pointed AI-visibility monitoring at our own site and wrote down what it said, including the parts that were not flattering. The short version: our AI-citation rate is low where it matters most, the method we used moved the number more than the underlying reality did, and the single most useful output of the exercise was learning to distrust a clean-looking figure that came from the wrong instrument. This is the first of a quarterly series, so the numbers below are a baseline, not a victory lap.
Most writing about AI visibility is advice. This is measurement. We are an agency that argues for serving clean server-rendered HTML to AI, so it was only fair to check whether the agency itself gets cited. The answer, across three snapshots in April, May and June 2026, was more interesting than a single number.
The instrument problem comes first
Before any finding, the caveat that reframes all of them. There are two common ways to measure how often an AI cites you, and they do not agree.
The cheap way is an API proxy. You send your prompts through a model’s API, read the text that comes back, and count brand mentions and links. It is repeatable and almost free, which is why it is everywhere. Its weakness is that the API path is not the product a customer uses. Published comparisons put the source overlap between API responses and the consumer web interface in the low single digits, so a proxy can tell you a model knows your name in the abstract while telling you nothing reliable about what a real user sees.
The honest way is to monitor the real consumer outputs, the same answers a person gets in the ChatGPT or Perplexity interface, including which sources the product actually surfaces. It costs more and is harder to automate. It is also the only number that corresponds to a lost or won visit.
We used both, in that order, and the gap between them is the most reusable lesson in this report.
April: the cheap proxy baseline
Our first snapshot, on 6 April, was an API-proxy run. The headline figures, across 26 queries against ChatGPT:
| Metric | Result |
|---|---|
| Brand-mention rate | 7.7 percent (2 of 26) |
| URL-citation rate | 0 percent |
| Strongest category | Plugins, 14.3 percent mention |
| Transactional category | 12.5 percent mention |
| Informational and local | 0 percent mention |
A mention rate under eight percent and a citation rate of zero reads like a disaster. The more useful detail is who got cited instead of us. When the model reached for a source on Polish WordPress work, it named directories and job boards: pracuj.pl appeared three times, alongside clutch.co, olx.pl, home.pl, nazwa.pl and a developer meetup listing. Those are not competitors who out-wrote us. They are aggregators the model trusts as generic answers to a commercial question. That pattern, the assistant defaulting to a directory rather than a specialist, turned out to be the real story, and on-page polish does not fix it.
The report itself carried the warning we want you to carry too: API proxy, directional trends only, roughly four percent source overlap with the web interface. We did not yet appreciate how much that warning mattered.
May: the instrument breaks honestly
The 11 May snapshot, also a proxy run, returned a result that looked like a bug and was actually the most honest output of the quarter. Across 20 queries, split over ChatGPT, Perplexity, Bing Copilot and Claude, the breakdown was:
| Engine | Queries | Cited | Citation rate |
|---|---|---|---|
| ChatGPT | 6 | 0 | undetermined |
| Perplexity | 6 | 0 | undetermined |
| Bing Copilot | 4 | 0 | undetermined |
| Claude | 4 | 0 | undetermined |
Every single query came back undetermined. Not “not cited”, undetermined. The proxy could not establish whether the answer was grounded in fetched pages at all, so it refused to score. A naive reading turns that into “zero percent citation rate” and a panicked Monday. The correct reading is that the instrument told us it could not see what we were asking it to measure. A measurement that returns undetermined is doing its job. A measurement that quietly returns zero in the same situation is lying to you, and plenty of AI-visibility dashboards do exactly that.
June: real monitoring, and a useful split
In June we switched to monitoring the real model outputs rather than the API. The picture sharpened immediately, and it was not uniform.
For one narrow query, a Polish studio serving foreign WordPress clients, we ranked first in five of the six models tested. That is a genuine, defensible position, and it matches how we describe ourselves. It also matches reality: it is a specific identity claim with little competition, exactly the kind of query a specialist should own.
For transactional WooCommerce queries, the queries closest to revenue, we were close to invisible. The models answered confidently and did not reach for us. ChatGPT was consistently the weakest channel for the brand, returning the least presence across the set. Perplexity was the strongest, which is unsurprising once you know that Perplexity leans hard on live web search rather than only on training memory. The competitors who did surface were mostly general SEO and SEM agencies marketing themselves as doing “AI SEO”, not WooCommerce specialists. The gap, in other words, is authority and association, not page quality.
What the three snapshots add up to
Read together, the quarter says three things plainly.
First, the method is part of the number. The April and May proxy runs and the June real-monitoring run were measuring the same site in the same weeks, and they disagreed enough that quoting any one figure without its method would be misleading. If a tool gives you an AI-citation rate without telling you whether it read the API or the product, distrust it.
First-party measurement is the point of this whole exercise, and it is the same discipline we apply to client work: a number you cannot reproduce is not evidence. We learned the same lesson the hard way with a synthetic-brand experiment, where a model confidently described a company that did not exist. Measuring a real brand has the opposite failure mode, confidently reporting zero when the truth is unknown, and both come back to checking the instrument before trusting the reading.
Second, identity queries are winnable and transactional queries are not, at least not on-page. We hold the narrow positioning query because it is specific and lightly contested. We lose the commercial queries because the models lean on directories and broad agencies, and no amount of cleaner HTML changes who a model already associates with “WooCommerce developer”. That is an off-page authority problem.
Third, the channel matters. A brand can be cited in Perplexity and near-absent in ChatGPT in the same week, because the two products ground answers differently. A single blended “AI visibility score” hides exactly the information you need.
What we changed
We did not rewrite pages in response to a proxy number, because that would be optimising for an instrument rather than a customer. Instead, the measurement changed where we spend effort.
- We fixed the query set and the cadence: a stable list of identity, informational and transactional queries, recorded monthly, with the engine and date stamped on every figure.
- We stopped comparing proxy figures with real-monitoring figures, and we now label every number with its method.
- We moved the transactional-visibility work off-page, because the gap there is association and authority, not on-page content, and that is documented in our off-page authority plan rather than in another rewrite.
- We kept serving everything in server-rendered HTML, which is the precondition for being citable at all and the subject of our note on why Western assistants read raw HTML.
How to run this yourself
You do not need our budget to start. The minimum honest setup is a fixed list of ten to twenty queries that real customers would ask, run once a month, with three columns recorded every time: the engine, the date, and whether your brand was named or your URL linked. Add a fourth column for which other domains were cited, because that tells you who you are actually competing against in the answer, which is rarely who you think.
If you use an automated tool, ask it one question before you trust a single chart: are you reading the API or the product? If it cannot answer, treat the output as directional only, the way our April and May runs were. And never let a tool turn an undetermined result into a confident zero.
The honest takeaway
A quarter of measuring our own AI citations produced one uncomfortable number, our transactional citation rate is low, and one genuinely valuable habit, never quote an AI-visibility figure without the method that produced it. The proxy runs made us look worse than reality in May and the real monitoring showed a defensible identity position the proxy had missed. Both readings were useful precisely because we wrote down how each was taken. This is report one. We will publish the next snapshot at the end of the quarter, on the same query set, so the series can be compared rather than admired. If you want to be cited by AI, start by measuring it honestly, and built that into the workflow we describe for GEO and LLMO.

