May 4, 2025
Our AI Agents Needed Glasses

How CarbonCopies AI fine-tuned its agentic UX testing system with traditional OCR to improve accuracy

👀 When Your AI Can't Read the Fine Print

At CarbonCopies AI, we train agentic AI testers to navigate websites like real users do — clicking, reading, and evaluating UX just like humans. But recently, we ran into a peculiar issue:

Our AI agents — powered by top-tier vision models like GPT-4V and Gemini — were making strange calls.

  • 💰 $1090 turned into $4090 — when verifying mortgage pricing calculations
  • 📦 Button labels got pluralized — flagging a non-existent spelling error
  • 📅 3-day trips showed up as 5-day getaways — a data mismatch that didn’t exist

At first, we blamed it on LLM hallucinations. But the pattern was clear: the models weren’t hallucinating — they were misreading small, styled text.

Turns out, our agents didn’t need better training.
They needed glasses.

🔍 Vision LLMs vs OCR: Who's Better at Reading?

We benchmarked our agents and realized something fundamental:
LLMs are not built for pixel-perfect reading. They’re great at layout understanding, UX critique, and semantic patterns — but when it comes to recognizing “3” vs “5” in small font sizes or stylized containers? They struggle.

GPT-4V / GeminiTraditional OCR (e.g., Tesseract)Struggles with small fontsTunable for small/fuzzy textGuesses text from layoutReads characters exactlyGreat at context + reasoningGreat at precision + clarity

🛠️ Our Solution: OCR Joins the AI Agent Workflow

To reduce false positives, we upgraded our architecture to include a dual-agent design:

  • 🧠 Agent 1: GPT-4V handles layout understanding, visual alignment, and UX critique
  • 🔍 Agent 2: OCR function (via Tesseract or AWS Textract) extracts exact text + coordinates
  • ⚖️ Reconciliation Agent compares outputs and removes inconsistencies before flagging bugs

The result? Sharper insight, fewer hallucinations, and more confident UX reporting.

🎯 Best Practices If You're Building Similar Systems

  1. Capture high-resolution screenshots — especially for text-heavy UI
  2. Avoid over-relying on LLMs for spelling and numbers
  3. Use OCR as a sanity-check layer before raising red flags
  4. Reconcile agent outputs with an arbitration step
  5. Log OCR confidence scores to tune sensitivity thresholds

📣 Final Takeaway

If your AI product is flagging non-existent bugs or reading “$1090” as “$4090,” it may not be hallucinating — it may just need better vision.

At CarbonCopies, combining LLM layout intelligence with OCR precision helped us deliver more reliable UX insights — and saved our clients from chasing false alarms.

Have you built a smarter, faster, or cheaper version of this hybrid approach?
We’d love to learn from you.

CarbonCopies creates AI twins of your customer segments: loyalty members, first-time buyers, Gen Z shoppers using BNPL, and shows what confuses them across product specs, category pages, and checkout flows. You get friction maps and redesigned screens, fast.

Built for global consumer apparel, electronics, and hospitality teams who need persona-level conversion wins before the next sprint.
Book 27min Tailored Analysis
Zero deck. Zero setup. Just insights

More from The Chronicles of CarbonCopies