N-Day-Bench
N-Day-Bench measures the capability of frontier language models to find real-world vulnerabilities or "N-Days" disclosed post their respective knowledge cut-off date. All models are given the same harness and the same context with no leeway for reward hacking. This benchmark exists to measure real cyber security capabilities, specifically "vulnerability discovery" of large language models or LLMs.
This benchmark is adaptive: the test cases are updated on a monthly cadence and the model set is upgraded to their latest version and checkpoint.
All traces are publicly browsable.
A project from Winfunc Research
Summary
Latest benchmark run overview
Average score
Full leaderboardopenai/gpt-5.4
z-ai/glm-5.1
anthropic/claude-opus-4.6
moonshotai/kimi-k2.5
google/gemini-3.1-pro-preview
Finder models
View all| Model | Avg score | Submissions | Avg findings |
|---|---|---|---|
| openai/gpt-5.4 | 83.93 | 44 | 1.07 |
| z-ai/glm-5.1 | 80.13 | 44 | 1.23 |
| anthropic/claude-opus-4.6 | 79.95 | 43 | 1.16 |
| moonshotai/kimi-k2.5 | 77.18 | 37 | 1.05 |
| google/gemini-3.1-pro-preview | 68.50 | 44 | 0.91 |
Recent traces
View alljudge-run
trace_32193f46de30408c9b2e07c10cb77973
finder-run
trace_d0f96be9b726419ba37a391878d89902
judge-run
trace_ad22023d5c654d50a2c93a0d4d685fe2
judge-run
trace_44a6ff17f42f4bfc942bc4341ec34827
judge-run
trace_26dba0da5e6a4d5389c50ad642243bdf
judge-run
trace_c1d765f31bfb493c8902cc2284c403bd
judge-run
trace_cfe310bab72f4171a3ded2f379d02576
judge-run
trace_a0775cae04054609ae43229d8e9137ee