BullshitBench v2: New Benchmark Tests Whether AI Models Can Recognize and Reject Nonsense Questions

Key Takeaways

▸BullshitBench v2 expands to 100 nonsense questions across five professional domains to test whether AI models can recognize and reject invalid prompts
▸The benchmark evaluates major models including OpenAI's GPT-5.3-chat and Google's Gemini-3.1-flash-lite-preview on their ability to detect nonsense rather than confidently hallucinate answers
▸New visualizations explore detection rates by domain, performance trends over time, and whether increased computational effort improves nonsense detection

Source:

Hacker Newshttps://github.com/petergpt/bullshit-benchmark↗

Summary

A new version of BullshitBench has been released, expanding the benchmark to 100 nonsense questions designed to test whether large language models can detect and reject invalid prompts rather than confidently answering them. Created by Peter Gostev, the benchmark evaluates models across five domains: software, finance, legal, medical, and physics. The v2 release includes comprehensive testing of major models including OpenAI's GPT-5.3-chat and Google's Gemini-3.1-flash-lite-preview, both evaluated as of March 4, 2026.

The benchmark specifically measures whether models can identify nonsensical questions, explicitly call them out, and avoid making confident assertions based on false premises. This addresses a critical challenge in AI safety: models that confidently hallucinate answers to impossible or nonsensical queries pose risks in real-world applications. The expanded v2 dataset provides domain-specific coverage, allowing researchers to assess whether models perform differently across technical, professional, and scientific contexts.

New visualizations in the v2 viewer enable analysis of detection rates by model, performance across different domains, and trends over time. The benchmark also explores whether increased computational effort (measured by tokens and cost) correlates with better nonsense detection. Results are publicly accessible through an interactive viewer, with all data and methodology available on GitHub for reproducibility and community contribution.

All results and methodology are publicly available through an interactive viewer and open-source GitHub repository

Editorial Opinion

BullshitBench v2 addresses one of the most underappreciated challenges in AI deployment: models that confidently answer nonsensical questions can be more dangerous than those that admit uncertainty. By systematically testing this capability across domains and model generations, this benchmark provides critical data for evaluating real-world AI reliability. The finding that computational effort may not consistently improve nonsense detection raises important questions about whether scaling alone can solve fundamental reasoning limitations.

BullshitBench v2: New Benchmark Tests Whether AI Models Can Recognize and Reject Nonsense Questions

Key Takeaways

▸BullshitBench v2 expands to 100 nonsense questions across five professional domains to test whether AI models can recognize and reject invalid prompts
▸The benchmark evaluates major models including OpenAI's GPT-5.3-chat and Google's Gemini-3.1-flash-lite-preview on their ability to detect nonsense rather than confidently hallucinate answers
▸New visualizations explore detection rates by domain, performance trends over time, and whether increased computational effort improves nonsense detection

Summary

All results and methodology are publicly available through an interactive viewer and open-source GitHub repository

Editorial Opinion

BullshitBench v2 addresses one of the most underappreciated challenges in AI deployment: models that confidently answer nonsensical questions can be more dangerous than those that admit uncertainty. By systematically testing this capability across domains and model generations, this benchmark provides critical data for evaluating real-world AI reliability. The finding that computational effort may not consistently improve nonsense detection raises important questions about whether scaling alone can solve fundamental reasoning limitations.

BullshitBench v2: New Benchmark Tests Whether AI Models Can Recognize and Reject Nonsense Questions

Key Takeaways

Summary

Editorial Opinion

More from Independent Research

How AI Discourse in Training Data Shapes Model Alignment, Study Shows

Distribution Fine Tuning: New Algorithm Eliminates LLM 'Slop' and Boosts Creativity 164%

MemEye Framework Reveals Gaps in Multimodal Agent Memory: Current VLMs Struggle with Fine-Grained Visual Details

Comments

Suggested

Barnes & Noble CEO Backs Selling AI-Written Books, Sparking Industry Debate on Transparency Standards

New Methodology Proposed for Selecting Runtime Architecture Patterns in Production LLM Agents

Google DeepMind Launches Gemini 3.5 Flash: New Lightweight AI Model

BullshitBench v2: New Benchmark Tests Whether AI Models Can Recognize and Reject Nonsense Questions

Key Takeaways

Summary

Editorial Opinion

More from Independent Research

How AI Discourse in Training Data Shapes Model Alignment, Study Shows

Distribution Fine Tuning: New Algorithm Eliminates LLM 'Slop' and Boosts Creativity 164%

MemEye Framework Reveals Gaps in Multimodal Agent Memory: Current VLMs Struggle with Fine-Grained Visual Details

Comments

Suggested

Barnes & Noble CEO Backs Selling AI-Written Books, Sparking Industry Debate on Transparency Standards

New Methodology Proposed for Selecting Runtime Architecture Patterns in Production LLM Agents

Google DeepMind Launches Gemini 3.5 Flash: New Lightweight AI Model