How CarbonCopies AI fine-tuned its agentic UX testing system with traditional OCR to improve accuracy

👀 When Your AI Can't Read the Fine Print

At CarbonCopies AI, we train agentic AI testers to navigate websites like real users do — clicking, reading, and evaluating UX just like humans. But recently, we ran into a peculiar issue:

Our AI agents — powered by top-tier vision models like GPT-4V and Gemini — were making strange calls.

💰 $1090 turned into $4090 — when verifying mortgage pricing calculations
📦 Button labels got pluralized — flagging a non-existent spelling error
📅 3-day trips showed up as 5-day getaways — a data mismatch that didn’t exist

At first, we blamed it on LLM hallucinations. But the pattern was clear: the models weren’t hallucinating — they were misreading small, styled text.

Turns out, our agents didn’t need better training.
They needed glasses.

🔍 Vision LLMs vs OCR: Who's Better at Reading?

We benchmarked our agents and realized something fundamental:
LLMs are not built for pixel-perfect reading. They’re great at layout understanding, UX critique, and semantic patterns — but when it comes to recognizing “3” vs “5” in small font sizes or stylized containers? They struggle.

GPT-4V / GeminiTraditional OCR (e.g., Tesseract)Struggles with small fontsTunable for small/fuzzy textGuesses text from layoutReads characters exactlyGreat at context + reasoningGreat at precision + clarity

🛠️ Our Solution: OCR Joins the AI Agent Workflow

To reduce false positives, we upgraded our architecture to include a dual-agent design:

🧠 Agent 1: GPT-4V handles layout understanding, visual alignment, and UX critique
🔍 Agent 2: OCR function (via Tesseract or AWS Textract) extracts exact text + coordinates
⚖️ Reconciliation Agent compares outputs and removes inconsistencies before flagging bugs

The result? Sharper insight, fewer hallucinations, and more confident UX reporting.

🎯 Best Practices If You're Building Similar Systems

Capture high-resolution screenshots — especially for text-heavy UI
Avoid over-relying on LLMs for spelling and numbers
Use OCR as a sanity-check layer before raising red flags
Reconcile agent outputs with an arbitration step
Log OCR confidence scores to tune sensitivity thresholds

📣 Final Takeaway

If your AI product is flagging non-existent bugs or reading “$1090” as “$4090,” it may not be hallucinating — it may just need better vision.

At CarbonCopies, combining LLM layout intelligence with OCR precision helped us deliver more reliable UX insights — and saved our clients from chasing false alarms.

Have you built a smarter, faster, or cheaper version of this hybrid approach?
We’d love to learn from you.

‍