N-Day-Bench

N-Day-Bench measures the capability of frontier language models to find real-world vulnerabilities or "N-Days" disclosed post their respective knowledge cut-off date. All models are given the same harness and the same context with no leeway for reward hacking. This benchmark exists to measure real cyber security capabilities, specifically "vulnerability discovery" of large language models or LLMs.

This benchmark is adaptive: the test cases are updated on a monthly cadence and the model set is upgraded to their latest version and checkpoint.

All traces are publicly browsable.

A project from Winfunc Research

Summary

Latest benchmark run overview

Statuscompleted

Advisories scanned1000

Accepted cases47

Skipped953

Average score

Full leaderboard

openai/gpt-5.4

83.93

z-ai/glm-5.1

80.13

anthropic/claude-opus-4.6

79.95

moonshotai/kimi-k2.5

77.18

google/gemini-3.1-pro-preview

68.50

Finder models

View all

Model	Avg score	Submissions	Avg findings
openai/gpt-5.4	83.93	44	1.07
z-ai/glm-5.1	80.13	44	1.23
anthropic/claude-opus-4.6	79.95	43	1.16
moonshotai/kimi-k2.5	77.18	37	1.05
google/gemini-3.1-pro-preview	68.50	44	0.91