At CarbonCopies AI, we train agentic AI testers to navigate websites like real users do — clicking, reading, and evaluating UX just like humans. But recently, we ran into a peculiar issue:
Our AI agents — powered by top-tier vision models like GPT-4V and Gemini — were making strange calls.
At first, we blamed it on LLM hallucinations. But the pattern was clear: the models weren’t hallucinating — they were misreading small, styled text.
Turns out, our agents didn’t need better training.
They needed glasses.
We benchmarked our agents and realized something fundamental:
LLMs are not built for pixel-perfect reading. They’re great at layout understanding, UX critique, and semantic patterns — but when it comes to recognizing “3” vs “5” in small font sizes or stylized containers? They struggle.
GPT-4V / GeminiTraditional OCR (e.g., Tesseract)Struggles with small fontsTunable for small/fuzzy textGuesses text from layoutReads characters exactlyGreat at context + reasoningGreat at precision + clarity
To reduce false positives, we upgraded our architecture to include a dual-agent design:
The result? Sharper insight, fewer hallucinations, and more confident UX reporting.
If your AI product is flagging non-existent bugs or reading “$1090” as “$4090,” it may not be hallucinating — it may just need better vision.
At CarbonCopies, combining LLM layout intelligence with OCR precision helped us deliver more reliable UX insights — and saved our clients from chasing false alarms.
Have you built a smarter, faster, or cheaper version of this hybrid approach?
We’d love to learn from you.