BotBeat
...
← Back

> ▌

Independent ResearchIndependent Research
RESEARCHIndependent Research2026-03-05

BullshitBench v2: New Benchmark Tests Whether AI Models Can Recognize and Reject Nonsense Questions

Key Takeaways

  • ▸BullshitBench v2 expands to 100 nonsense questions across five professional domains to test whether AI models can recognize and reject invalid prompts
  • ▸The benchmark evaluates major models including OpenAI's GPT-5.3-chat and Google's Gemini-3.1-flash-lite-preview on their ability to detect nonsense rather than confidently hallucinate answers
  • ▸New visualizations explore detection rates by domain, performance trends over time, and whether increased computational effort improves nonsense detection
Source:
Hacker Newshttps://github.com/petergpt/bullshit-benchmark↗

Summary

A new version of BullshitBench has been released, expanding the benchmark to 100 nonsense questions designed to test whether large language models can detect and reject invalid prompts rather than confidently answering them. Created by Peter Gostev, the benchmark evaluates models across five domains: software, finance, legal, medical, and physics. The v2 release includes comprehensive testing of major models including OpenAI's GPT-5.3-chat and Google's Gemini-3.1-flash-lite-preview, both evaluated as of March 4, 2026.

The benchmark specifically measures whether models can identify nonsensical questions, explicitly call them out, and avoid making confident assertions based on false premises. This addresses a critical challenge in AI safety: models that confidently hallucinate answers to impossible or nonsensical queries pose risks in real-world applications. The expanded v2 dataset provides domain-specific coverage, allowing researchers to assess whether models perform differently across technical, professional, and scientific contexts.

New visualizations in the v2 viewer enable analysis of detection rates by model, performance across different domains, and trends over time. The benchmark also explores whether increased computational effort (measured by tokens and cost) correlates with better nonsense detection. Results are publicly accessible through an interactive viewer, with all data and methodology available on GitHub for reproducibility and community contribution.

  • All results and methodology are publicly available through an interactive viewer and open-source GitHub repository

Editorial Opinion

BullshitBench v2 addresses one of the most underappreciated challenges in AI deployment: models that confidently answer nonsensical questions can be more dangerous than those that admit uncertainty. By systematically testing this capability across domains and model generations, this benchmark provides critical data for evaluating real-world AI reliability. The finding that computational effort may not consistently improve nonsense detection raises important questions about whether scaling alone can solve fundamental reasoning limitations.

Large Language Models (LLMs)Machine LearningEthics & BiasAI Safety & AlignmentOpen Source

More from Independent Research

Independent ResearchIndependent Research
RESEARCH

VeriCache: New Framework Enables Lossless Compression for KV Cache in LLM Inference

2026-07-01
Independent ResearchIndependent Research
RESEARCH

Program Synthesis Enables Interpretable Explanations of Transformer Attention Mechanisms

2026-06-18
Independent ResearchIndependent Research
RESEARCH

HRM-Text Achieves Competitive LLM Performance With 100-900x Fewer Training Tokens

2026-06-17

Comments

Suggested

Google / AlphabetGoogle / Alphabet
RESEARCH

Stanford Researchers Use Multi-Agent AI and Reinforcement Learning to Improve HIP Kernel Generation for AMD GPUs

2026-07-04
LLM Agent EcosystemLLM Agent Ecosystem
RESEARCH

Researchers Expose Critical Payload-Less Attack on LLM Agent Supply Chains

2026-07-04
OpenAIOpenAI
INDUSTRY REPORT

Investigation Uncovers AI-Generated Deepfakes in Lily Jay Foundation Charity Fraud

2026-07-04
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us