Researchers Use LLMs to Automate Compiler Testing, Discover 88 Bugs in MLIR Dialects

Key Takeaways

▸Germinator uses LLMs to automatically generate test seeds for compiler fuzzing without requiring manual corpus construction or training data
▸The tool achieved 10-120% improvement in line coverage over grammar-based baselines across 91 MLIR dialects
▸Discovered 88 previously unknown bugs (40 confirmed), including 23 in dialects that previously had no automated testing

Source:

Hacker Newshttps://arxiv.org/abs/2512.05887↗

Summary

Researchers from the University of Wisconsin-Madison and the Max Planck Institute for Security and Privacy have developed Germinator, a novel tool that leverages large language models to automatically generate test cases for compiler fuzzing. The research addresses a critical challenge in testing extensible compiler frameworks like MLIR, which enable rapid creation of domain-specific language dialects but lack comprehensive testing infrastructure. Traditional fuzzing approaches require manual seed corpus construction for each dialect or fail to effectively target dialect-specific features.

Germinator combines grammar extraction from dialect specifications with pre-trained LLMs to automatically generate diverse, representative seed inputs without requiring manual intervention or training data. The tool then uses these seeds to bootstrap coverage-guided fuzzers that can effectively test low-resource language dialects. When evaluated across six MLIR projects spanning 91 dialects, Germinator improved line coverage by 10-120% compared to grammar-based baselines.

The practical impact is substantial: Germinator discovered 88 previously unknown bugs, with 40 already confirmed by maintainers. Notably, 23 of these bugs were found in dialects that had no prior automated test generators, demonstrating the tool's ability to bring automated testing to previously untested compiler components. The research shows how LLMs can be effectively applied to software engineering challenges beyond code generation, particularly in creating testing infrastructure for complex, heterogeneous systems where manual test creation is impractical.

Demonstrates dialect-agnostic approach that works across different language dialects while remaining effective at finding dialect-specific bugs
Shows practical application of LLMs in software testing infrastructure beyond traditional code generation use cases

Editorial Opinion

This research represents an important application of LLMs to software reliability—an area that could have more immediate practical impact than many generative AI applications. The ability to automatically bootstrap testing infrastructure for compiler dialects addresses a real pain point in compiler development, where testing has traditionally lagged behind implementation speed. The 88 bugs discovered, particularly in previously untested dialects, validate that this isn't just an academic exercise but a tool that can immediately improve software quality in production systems. As compiler frameworks become more extensible and domain-specific languages proliferate, automated testing approaches like Germinator may become essential infrastructure.

Researchers Use LLMs to Automate Compiler Testing, Discover 88 Bugs in MLIR Dialects

Key Takeaways

▸Germinator uses LLMs to automatically generate test seeds for compiler fuzzing without requiring manual corpus construction or training data
▸The tool achieved 10-120% improvement in line coverage over grammar-based baselines across 91 MLIR dialects
▸Discovered 88 previously unknown bugs (40 confirmed), including 23 in dialects that previously had no automated testing

Summary

Demonstrates dialect-agnostic approach that works across different language dialects while remaining effective at finding dialect-specific bugs
Shows practical application of LLMs in software testing infrastructure beyond traditional code generation use cases

Editorial Opinion

This research represents an important application of LLMs to software reliability—an area that could have more immediate practical impact than many generative AI applications. The ability to automatically bootstrap testing infrastructure for compiler dialects addresses a real pain point in compiler development, where testing has traditionally lagged behind implementation speed. The 88 bugs discovered, particularly in previously untested dialects, validate that this isn't just an academic exercise but a tool that can immediately improve software quality in production systems. As compiler frameworks become more extensible and domain-specific languages proliferate, automated testing approaches like Germinator may become essential infrastructure.

Researchers Use LLMs to Automate Compiler Testing, Discover 88 Bugs in MLIR Dialects

Key Takeaways

Summary

Editorial Opinion

Comments

Suggested

Stanford Researchers Use Multi-Agent AI and Reinforcement Learning to Improve HIP Kernel Generation for AMD GPUs

Researchers Expose Critical Payload-Less Attack on LLM Agent Supply Chains

Meta Acknowledges AI Agent Development Slower Than Expected, Despite $145B Infrastructure Investment

Researchers Use LLMs to Automate Compiler Testing, Discover 88 Bugs in MLIR Dialects

Key Takeaways

Summary

Editorial Opinion

Comments

Suggested

Stanford Researchers Use Multi-Agent AI and Reinforcement Learning to Improve HIP Kernel Generation for AMD GPUs

Researchers Expose Critical Payload-Less Attack on LLM Agent Supply Chains

Meta Acknowledges AI Agent Development Slower Than Expected, Despite $145B Infrastructure Investment