BotBeat
...
← Back

> ▌

Independent ResearchIndependent Research
RESEARCHIndependent Research2026-03-05

BullshitBench v2: New Benchmark Tests Whether AI Models Can Recognize and Reject Nonsense Questions

Key Takeaways

  • ▸BullshitBench v2 expands to 100 nonsense questions across five professional domains to test whether AI models can recognize and reject invalid prompts
  • ▸The benchmark evaluates major models including OpenAI's GPT-5.3-chat and Google's Gemini-3.1-flash-lite-preview on their ability to detect nonsense rather than confidently hallucinate answers
  • ▸New visualizations explore detection rates by domain, performance trends over time, and whether increased computational effort improves nonsense detection
Source:
Hacker Newshttps://github.com/petergpt/bullshit-benchmark↗

Summary

A new version of BullshitBench has been released, expanding the benchmark to 100 nonsense questions designed to test whether large language models can detect and reject invalid prompts rather than confidently answering them. Created by Peter Gostev, the benchmark evaluates models across five domains: software, finance, legal, medical, and physics. The v2 release includes comprehensive testing of major models including OpenAI's GPT-5.3-chat and Google's Gemini-3.1-flash-lite-preview, both evaluated as of March 4, 2026.

The benchmark specifically measures whether models can identify nonsensical questions, explicitly call them out, and avoid making confident assertions based on false premises. This addresses a critical challenge in AI safety: models that confidently hallucinate answers to impossible or nonsensical queries pose risks in real-world applications. The expanded v2 dataset provides domain-specific coverage, allowing researchers to assess whether models perform differently across technical, professional, and scientific contexts.

New visualizations in the v2 viewer enable analysis of detection rates by model, performance across different domains, and trends over time. The benchmark also explores whether increased computational effort (measured by tokens and cost) correlates with better nonsense detection. Results are publicly accessible through an interactive viewer, with all data and methodology available on GitHub for reproducibility and community contribution.

  • All results and methodology are publicly available through an interactive viewer and open-source GitHub repository

Editorial Opinion

BullshitBench v2 addresses one of the most underappreciated challenges in AI deployment: models that confidently answer nonsensical questions can be more dangerous than those that admit uncertainty. By systematically testing this capability across domains and model generations, this benchmark provides critical data for evaluating real-world AI reliability. The finding that computational effort may not consistently improve nonsense detection raises important questions about whether scaling alone can solve fundamental reasoning limitations.

Large Language Models (LLMs)Machine LearningEthics & BiasAI Safety & AlignmentOpen Source

More from Independent Research

Independent ResearchIndependent Research
RESEARCH

New Research Proposes Infrastructure-Level Safety Framework for Advanced AI Systems

2026-04-05
Independent ResearchIndependent Research
RESEARCH

DeepFocus-BP: Novel Adaptive Backpropagation Algorithm Achieves 66% FLOP Reduction with Improved NLP Accuracy

2026-04-04
Independent ResearchIndependent Research
RESEARCH

Research Reveals How Large Language Models Process and Represent Emotions

2026-04-03

Comments

Suggested

OracleOracle
POLICY & REGULATION

AI Agents Promise to 'Run the Business'—But Who's Liable When Things Go Wrong?

2026-04-05
AnthropicAnthropic
POLICY & REGULATION

Anthropic Explores AI's Role in Autonomous Weapons Policy with Pentagon Discussion

2026-04-05
GitHubGitHub
PRODUCT LAUNCH

GitHub Launches Squad: Open Source Multi-Agent AI Framework to Simplify Complex Workflows

2026-04-05
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us