Research: How URLs in Prompts Can Influence LLM Outputs Toward Training Data

Key Takeaways

▸URLs in prompts influence LLM outputs only for content that was included in the model's training data
▸LLM providers lack transparency about training data sources, collection methods, and knowledge cutoff dates
▸Content loaded dynamically via JavaScript is largely excluded from LLM training data due to crawler limitations

Source:

Hacker Newshttps://aifoc.us/influencing-model-output-with-urls/↗

Summary

Independent researcher Paul Kinlan conducted extensive experimentation to investigate whether URLs appearing in LLM prompts influence model outputs. Through careful testing across multiple models, he found that URLs do indeed steer LLM behavior—but only when the URL's content exists within the model's training data. The research also uncovered critical insights into how LLM training data is collected, revealing significant transparency gaps about knowledge cutoff dates and data sources.

Kinlan's investigation revealed important differences in how LLM providers gather training data. Anthropic's ClaudeBot and OpenAI's GPTBot do not execute JavaScript, meaning dynamically-loaded content is unlikely to be included in training data. Notably, OpenAI's search-specific crawler (OAI-SearchBot) does execute JavaScript. The research also highlighted substantial amounts of data excluded from LLM models, particularly JavaScript-dependent content, demonstrating that not all public web content reaches model training pipelines.

Different LLM providers employ different crawling strategies, affecting which content gets included in their models

Editorial Opinion

This research exposes a critical transparency gap in LLM development. As organizations rely on LLMs to reference external URLs, understanding whether and how those URLs influence model behavior has become essential. Kinlan's findings reveal that LLM providers must be far more forthcoming about their training data collection methods, knowledge cutoff dates, and crawler capabilities—transparency that is vital for users to effectively prompt these systems and understand their limitations.

Anthropic

RESEARCH Anthropic2026-07-03

Research: How URLs in Prompts Can Influence LLM Outputs Toward Training Data

Key Takeaways

▸URLs in prompts influence LLM outputs only for content that was included in the model's training data
▸LLM providers lack transparency about training data sources, collection methods, and knowledge cutoff dates
▸Content loaded dynamically via JavaScript is largely excluded from LLM training data due to crawler limitations

Source:

Hacker Newshttps://aifoc.us/influencing-model-output-with-urls/↗

Summary

Different LLM providers employ different crawling strategies, affecting which content gets included in their models

Editorial Opinion

This research exposes a critical transparency gap in LLM development. As organizations rely on LLMs to reference external URLs, understanding whether and how those URLs influence model behavior has become essential. Kinlan's findings reveal that LLM providers must be far more forthcoming about their training data collection methods, knowledge cutoff dates, and crawler capabilities—transparency that is vital for users to effectively prompt these systems and understand their limitations.

Research: How URLs in Prompts Can Influence LLM Outputs Toward Training Data

Key Takeaways

Summary

Editorial Opinion

More from Anthropic

How Political Beliefs Shape AI Agent Analysis: New Research Reveals Systematic Bias in AI Reasoning

Alibaba Bans Claude Code Over Hidden Tracking Code Discovered in Anthropic's Developer Tool

Independent Analysis Reveals True Token Costs and Usage Limits Behind Leading Coding Agent Plans

Comments

Suggested

AMD MI355X Proves Competitive for Frontier AI Inference at 2.75x Lower Cost Than Blackwell

Meta AI Chief Claims New LLM Model Has Caught Up with OpenAI's Flagship

Mistral AI Launches Leanstral 1.5, Enhanced Open-Source Code Agent for Mathematical Proofs

Research: How URLs in Prompts Can Influence LLM Outputs Toward Training Data

Key Takeaways

Summary

Editorial Opinion

More from Anthropic

How Political Beliefs Shape AI Agent Analysis: New Research Reveals Systematic Bias in AI Reasoning

Alibaba Bans Claude Code Over Hidden Tracking Code Discovered in Anthropic's Developer Tool

Independent Analysis Reveals True Token Costs and Usage Limits Behind Leading Coding Agent Plans

Comments

Suggested

AMD MI355X Proves Competitive for Frontier AI Inference at 2.75x Lower Cost Than Blackwell

Meta AI Chief Claims New LLM Model Has Caught Up with OpenAI's Flagship

Mistral AI Launches Leanstral 1.5, Enhanced Open-Source Code Agent for Mathematical Proofs