Skip to content
OpenZeppelin Finds Flaws in OpenAI's EVMbench Blockchain Security Benchmark
AI2 min
16

OpenZeppelin Finds Flaws in OpenAI's EVMbench Blockchain Security Benchmark

Blockchain security firm OpenZeppelin audited OpenAI's EVMbench and uncovered training data contamination and factual errors in the vulnerability dataset used to test AI agents.

📝
CoinJP Editorial
0
CoinJP Editorial · 0 articles

Blockchain cybersecurity firm OpenZeppelin has conducted an independent audit of OpenAI's newly launched EVMbench benchmark, uncovering significant methodological flaws and data contamination issues that undermine the reliability of AI agent testing results.

«https://t.co/yW00RmRBZQ» — OpenZeppelin (@OpenZeppelin), original post

What Is EVMbench and Why Does It Matter

OpenAI launched EVMbench in mid-February in partnership with investment firm Paradigm. The tool is designed to evaluate how well AI agents can detect, fix, and exploit vulnerabilities in smart contracts. It is built on a curated set of 120 vulnerabilities identified during security audits conducted between 2024 and 2025.

OpenZeppelin welcomed the initiative but decided to subject EVMbench to the same rigorous standards it applies when auditing major DeFi protocols, including Aave, Lido, and Uniswap.

Why This Matters

Benchmarks are foundational to the development of AI-powered blockchain security tools. If the testing dataset contains errors or allows models to recall answers from training data rather than discovering them independently, the results become meaningless. In the DeFi sector, where smart contract exploits routinely cause losses in the hundreds of millions, the integrity of such evaluation tools carries enormous weight.

Training Data Contamination

OpenZeppelin's primary concern centers on data contamination. The leading AI models tested against EVMbench have a knowledge cutoff of August 2025. Since the benchmark's vulnerability set spans 2024–2025, these models likely encountered information about those specific bugs during their training phase.

Even with internet access disabled, the models could have been "remembering" known vulnerabilities rather than genuinely identifying them. This raises a fundamental question: can these AI agents actually detect novel threats, or are they merely retrieving memorized information? The benchmark's current design cannot answer that question.

Factual Errors in the Vulnerability Dataset

Beyond the contamination issue, OpenZeppelin's auditors found concrete factual errors within EVMbench's dataset. At least four vulnerabilities classified as "high risk" turned out to be non-functional — the described attack vectors simply do not work in practice.

Despite this, AI agents received credit for correctly identifying these nonexistent issues. OpenZeppelin emphasized that these are not subjective disagreements about severity ratings but clear cases where the described attack scenario is technically impossible.

OpenZeppelin's Verdict: Standards Must Match Ambitions

Despite the identified shortcomings, OpenZeppelin affirmed that artificial intelligence will play a pivotal role in the future of blockchain security. However, the firm cautioned against rushing deployment at the expense of data quality and testing rigor.

The question is not whether AI will change smart contract security — it will. The question is whether the benchmarks and data we use to build these tools will meet the same standards as the contracts they are meant to protect.

This audit follows a broader trend of scrutinizing AI tools in the security space. In November, Microsoft researchers introduced their own testing environment for AI agents and identified vulnerabilities inherent to modern digital assistants. OpenZeppelin's review of EVMbench adds another layer of critical evaluation as the industry works to integrate AI into blockchain security workflows.

ai-benchmarkartificial-intelligencedefi-securityevmopenaiopenzeppelinsmart-contract-security

Frequently Asked Questions

What is OpenAI's EVMbench?

EVMbench is a benchmark launched by OpenAI in partnership with Paradigm in mid-February. It evaluates AI agents' ability to detect, fix, and exploit vulnerabilities in smart contracts using a dataset of 120 vulnerabilities from 2024-2025 audits.

What problems did OpenZeppelin find in EVMbench?

OpenZeppelin identified two main issues: training data contamination, where AI models may have memorized the vulnerabilities from their training data, and factual errors — at least four high-risk vulnerabilities in the dataset were found to be non-functional.

What is data contamination in AI benchmarks?

Data contamination occurs when AI models being tested have already encountered the test data during training. In EVMbench's case, the tested models had a knowledge cutoff of August 2025 and may have already known about the 2024-2025 vulnerabilities in the benchmark.

Who audited OpenAI's EVMbench benchmark?

The audit was conducted by OpenZeppelin, a blockchain cybersecurity firm that provides security services to major DeFi protocols including Aave, Lido, and Uniswap.

Does OpenZeppelin think AI can improve smart contract security?

Yes, OpenZeppelin confirmed that AI will play a key role in blockchain security's future. However, the firm stressed that benchmarks and datasets must meet the same high standards as the smart contracts they are designed to protect.

Read also

AI

OpenAI Secures Record $110 Billion Round at $730 Billion Valuation

OpenAI closed the largest startup funding round in history at $110 billion, backed by Amazon, SoftBank, and Nvidia, with a $730 billion valuation.

4 min·🔥 1
AI

DeepSeek Launches V4-Pro: Open-Source Model Outperforms Claude Opus 4.6 and GPT-5.4

Chinese AI startup DeepSeek released a preview of its V4 model family, with the flagship V4-Pro boasting 1.6 trillion parameters and surpassing leading closed-source models in multiple benchmarks.

3 min·🔥 0
AI

AI Audit Uncovers Critical Liveness Bug in Ethereum's Nethermind Client

Octane Security's AI discovered a high-severity vulnerability in the Nethermind execution client that could have halted block production for 38% of Ethereum mainnet validators. The Ethereum Foundation awarded a maximum $50,000 bounty.

3 min·🔥 1
Analytics

Weekly Recap: NYT Satoshi Investigation, North Korean Hackers in DeFi, and Anthropic's AI 'Escape'

Bitcoin climbed above $71,000, a NYT journalist named Adam Back as Satoshi Nakamoto, ZachXBT exposed a network of North Korean IT agents in crypto projects, and Anthropic shelved its new AI model after it escaped a sandbox and found thousands of zero-day vulnerabilities.

5 min·🔥 0
Market

Drift Protocol Hacked for $280M, Google Lowers Quantum Threat Estimate — Weekly Recap

Bitcoin held steady at $67,000, North Korean hackers stole $280M from Drift Protocol, Anthropic leaked Claude Code source, and Google drastically reduced quantum attack threshold estimates for crypto.

5 min·🔥 0
Security

Ethereum Address Poisoning Attacks Surge 612% After Fusaka Upgrade

The Fusaka upgrade's reduced gas fees have triggered an explosion in address poisoning attacks on Ethereum, with dust transfers of USDT soaring 612% in just 90 days.

3 min·🔥 0