How LLMs Parse Your Content - Technical Breakdown of ChatGPT, Claude, and Perplexity Extraction | AiVIS Cite Ledger Blogs
By R. Mason · · 12 min read · TECHNOLOGY
Most teams have no idea what LLMs actually evaluate when parsing their site. Here\
Key Takeaways
- LLMs use structured data (JSON-LD, microdata) as the primary extraction signal.
- Answer engines tokenize differently per model: ChatGPT ~4k context, Claude ~100k, Perplexity varies.
- Answer blocks (FAQPage, HowTo, Question-Answer) are often prioritized over unstructured body copy for extraction.
- URL normalization, canonicalization, and robots.txt directives are evaluated before content parsing.
Cited external sources
Introducing ChatGPT Search
OpenAI · 2025-11-20
Evidence that synthesized answers select and cite sources rather than exposing full SERPs.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
arXiv · Patrick Lewis et al. · 2025-11-20
Background for retrieval, chunking, and grounded answer assembly.
Search Central documentation on structured data
Google Search Central · 2026-04-16
Reference for JSON-LD and machine-readable content signals.