BotBeat
...
← Back

> ▌

Independent ResearchIndependent Research
RESEARCHIndependent Research2026-03-05

BullshitBench v2: New Benchmark Tests Whether AI Models Can Recognize and Reject Nonsense Questions

Key Takeaways

  • ▸BullshitBench v2 expands to 100 nonsense questions across five professional domains to test whether AI models can recognize and reject invalid prompts
  • ▸The benchmark evaluates major models including OpenAI's GPT-5.3-chat and Google's Gemini-3.1-flash-lite-preview on their ability to detect nonsense rather than confidently hallucinate answers
  • ▸New visualizations explore detection rates by domain, performance trends over time, and whether increased computational effort improves nonsense detection
Source:
Hacker Newshttps://github.com/petergpt/bullshit-benchmark↗

Summary

A new version of BullshitBench has been released, expanding the benchmark to 100 nonsense questions designed to test whether large language models can detect and reject invalid prompts rather than confidently answering them. Created by Peter Gostev, the benchmark evaluates models across five domains: software, finance, legal, medical, and physics. The v2 release includes comprehensive testing of major models including OpenAI's GPT-5.3-chat and Google's Gemini-3.1-flash-lite-preview, both evaluated as of March 4, 2026.

The benchmark specifically measures whether models can identify nonsensical questions, explicitly call them out, and avoid making confident assertions based on false premises. This addresses a critical challenge in AI safety: models that confidently hallucinate answers to impossible or nonsensical queries pose risks in real-world applications. The expanded v2 dataset provides domain-specific coverage, allowing researchers to assess whether models perform differently across technical, professional, and scientific contexts.

New visualizations in the v2 viewer enable analysis of detection rates by model, performance across different domains, and trends over time. The benchmark also explores whether increased computational effort (measured by tokens and cost) correlates with better nonsense detection. Results are publicly accessible through an interactive viewer, with all data and methodology available on GitHub for reproducibility and community contribution.

  • All results and methodology are publicly available through an interactive viewer and open-source GitHub repository

Editorial Opinion

BullshitBench v2 addresses one of the most underappreciated challenges in AI deployment: models that confidently answer nonsensical questions can be more dangerous than those that admit uncertainty. By systematically testing this capability across domains and model generations, this benchmark provides critical data for evaluating real-world AI reliability. The finding that computational effort may not consistently improve nonsense detection raises important questions about whether scaling alone can solve fundamental reasoning limitations.

Large Language Models (LLMs)Machine LearningEthics & BiasAI Safety & AlignmentOpen Source

More from Independent Research

Independent ResearchIndependent Research
RESEARCH

How AI Discourse in Training Data Shapes Model Alignment, Study Shows

2026-05-18
Independent ResearchIndependent Research
RESEARCH

Distribution Fine Tuning: New Algorithm Eliminates LLM 'Slop' and Boosts Creativity 164%

2026-05-18
Independent ResearchIndependent Research
RESEARCH

MemEye Framework Reveals Gaps in Multimodal Agent Memory: Current VLMs Struggle with Fine-Grained Visual Details

2026-05-18

Comments

Suggested

Generative AIGenerative AI
INDUSTRY REPORT

Barnes & Noble CEO Backs Selling AI-Written Books, Sparking Industry Debate on Transparency Standards

2026-05-20
Research CommunityResearch Community
RESEARCH

New Methodology Proposed for Selecting Runtime Architecture Patterns in Production LLM Agents

2026-05-20
Google / AlphabetGoogle / Alphabet
PRODUCT LAUNCH

Google DeepMind Launches Gemini 3.5 Flash: New Lightweight AI Model

2026-05-20
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us