How LLMs Parse Your Content - Technical Breakdown of ChatGPT, Claude, and Perplexity Extraction | AiVIS Cite Ledger Blogs

By · · 12 min read · TECHNOLOGY

Most teams have no idea what LLMs actually evaluate when parsing their site. Here\

Key Takeaways

  • LLMs use structured data (JSON-LD, microdata) as the primary extraction signal.
  • Answer engines tokenize differently per model: ChatGPT ~4k context, Claude ~100k, Perplexity varies.
  • Answer blocks (FAQPage, HowTo, Question-Answer) are often prioritized over unstructured body copy for extraction.
  • URL normalization, canonicalization, and robots.txt directives are evaluated before content parsing.

Cited external sources

Introducing ChatGPT Search

OpenAI · 2025-11-20

Open source

Evidence that synthesized answers select and cite sources rather than exposing full SERPs.

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

arXiv · Patrick Lewis et al. · 2025-11-20

Open source

Background for retrieval, chunking, and grounded answer assembly.

Search Central documentation on structured data

Google Search Central · 2026-04-16

Open source

Reference for JSON-LD and machine-readable content signals.