Gradio

Mode

MCP Tools

Show

#	Agent	Organization	Mode	MCP	Score	Tasks	Zero-Score
📄	📄 Human Oracle baseline	Romero Lab	—	—	74.8	76/76	0
👨‍🔬	👨‍🔬 Human Expert baseline	Romero Lab	—	—	61.2	76/76	0
1	DeepSeek V3	DeepSeek	benchmark	reference	60.4	76/76	1
2	DeepSeek V3	DeepSeek	user	reference	58.5	76/76	7
3	GPT-5	OpenAI	benchmark	reference	55.6	76/76	2
4	GPT-5	OpenAI	user	reference	55.3	76/76	4
🔧	🔧 Hardcoded Pipeline baseline	Deterministic	—	—	54.2	76/76	0
5	Claude Sonnet 4.5	Anthropic	user	reference	50.2	76/76	16
6	Claude Sonnet 4.5	Anthropic	benchmark	reference	41.2	76/76	23
7	Gemini 2.5 Pro	Google	user	reference	8.8	76/76	74
8	Gemini 2.5 Pro	Google	benchmark	reference	8.1	76/76	75

Select Agent

Approach ↓ / Subject →	Antibody	Binder	Enzyme	Scaffold	Fluorescent Prot.	Mean
De Novo Design	79 n=4	72 n=19	76 n=2	76 n=21	79 n=1	76.2
Redesign	69 n=5	n/a	76 n=10	77 n=4	77 n=10	74.8

Agent 1

Agent 2

Mode semantics: Benchmark mode exposes atomic tools without pipeline hints (unguided); User mode packages them into composite workflows with explicit pipeline structure (guided). Guidance lifts the lowest-tier agents but does not consistently help capable ones, and never closes the depth gap (see Depth Gap tab).

DeepSeek V3

Benchmark60.4

User58.5

Delta+-2.0 (+-3%)

approach-0.3

orchestration-0.2

quality-0.3

feasibility-0.9

novelty-0.2

diversity-0.0

GPT-5

Benchmark55.6

User55.3

Delta+-0.4 (+-1%)

approach+0.7

orchestration+1.4

quality-2.1

feasibility-0.1

novelty-0.1

diversity-0.2

Claude Sonnet 4.5

Benchmark41.2

User50.2

Delta+9.1 (+22%)

approach+1.7

orchestration+1.6

quality+3.8

feasibility+0.8

novelty+0.4

diversity+0.7

Gemini 2.5 Pro

Benchmark8.1

User8.8

Delta+0.6 (+8%)

approach-0.2

orchestration+0.3

quality+0.2

feasibility+0.2

novelty+0.1

diversity+0.0

Causal interventions on the depth gap

Causal intervention experiments on the depth gap. 18 representative tasks rerun under three conditions: baseline (no intervention), forced_depth (mandate ≥3 evaluation passes per candidate), and low_diversity_control (constrain candidate count without forcing depth). Reruns are scored on a representative 18-task subset that spans all 9 occupied taxonomy cells.

Headline: Forced-depth lifts DeepSeek V3 by +9.3 and GPT-5 by +15.9 points without any change to the underlying model or tools, while the low-diversity control hurts DeepSeek V3 (−2.3). The dissociation is cleanest on the strongest agent, where it provides direct causal evidence that evaluation depth — not the mere act of process intervention — drives the gain. GPT-5's response is more uniform across both interventions; we report the raw deltas without smoothing.

Run	Condition	Score	Δ vs baseline	Approach / Orch.	Quality	Diversity
DeepSeek V3 — baseline	Baseline	58.7	—	13.4 / 11.2	16.1	3.6
GPT-5 — baseline	Baseline	46.8	—	8.3 / 6.2	15.4	3.9
Human Expert — baseline	Baseline	56.7	—	18.3 / 9.3	11.1	2.3
DeepSeek V3 — forced depth	Forced Depth	68.1	+9.3	18.4 / 12.3	16.1	3.9
GPT-5 — forced depth	Forced Depth	62.7	+15.9	18.3 / 11.7	15.0	3.1
DeepSeek V3 — low diversity	Low-Diversity Control	56.4	-2.3	13.1 / 11.1	16.0	3.2
GPT-5 — low diversity	Low-Diversity Control	61.5	+14.7	13.1 / 12.0	16.2	3.2
Human Expert — shallow	Low-Diversity Control	55.1	-1.6	18.2 / 9.3	11.2	0.6

Scoring uses the same 100-point hybrid rubric as the main leaderboard but is restricted to 18 representative tasks; absolute values therefore differ from the full-benchmark mean. The delta vs baseline compares each agent against its own untreated baseline run, isolating the intervention effect.

Submit your agent

BioDesignBench evaluates models inside Romero Lab infrastructure to keep the 76 task specifications contamination-clean. You provide an LLM API key and a model name, and we run the BioDesignBench agent loop against your model with the reference 17-tool MCP server. Task content never leaves Romero Lab except through your chosen LLM provider's API call.

How your credentials are handled:

Your API key is stored on the submission row only between submission and dispatch, then scrubbed automatically regardless of whether the run succeeded.
Each task carries a unique 16-character canary token (invisible HTML comment) so we can retrospectively detect leakage in published models.
The MCP server (reference or custom) sees only operational tool arguments, never the raw task description or evaluation criteria.

Reference vs Custom MCP

Reference (default): your agent uses our hosted protein-design-mcp endpoint. Eligible for the reference ranking.
Custom: provide your own public MCP URL implementing the same 17-tool schema. Useful for benchmarking new tool implementations against an identical model under identical task prompts. Tagged with a custom badge.

Rate limit: 1 submission per calendar month per organization. Your LLM-API and (if reference) MCP-GPU costs are billed to your account / paid by Romero Lab respectively; please be considerate.

Submission status

Check your submission status or manage the pipeline (admin only).

What is BioDesignBench?

BioDesignBench is a benchmark for evaluating LLM agents as orchestrators of multi-step stochastic protein-design pipelines. Unlike chemistry- or code-agent benchmarks, where tool chains are largely deterministic, protein design demands repeated sampling from generative tools (RFdiffusion, ProteinMPNN) and iterative cross-validation through several biophysical metrics. We test the full agentic loop — plan → sample → evaluate across multiple metrics → iterate — over 76 expert-curated tasks drawn from 2024–2026 literature, exposed through 17 MCP-integrated tools.

76

design tasks

9

taxonomy cells
(2 approaches × 5 subjects)

17

MCP tools

100

point rubric

Three principal findings

1. Top-tier agents now beat a deterministic pipeline

DeepSeek V3 and GPT-5 surpass a hand-engineered hardcoded pipeline (54.2) under both modes. Autonomous protein-design orchestration is no longer infeasible — but a substantial gap to the human expert (61.3) and oracle (74.9) remains.

2. Coverage–depth dissociation

Workflow guidance closes the coverage gap (Rescue Index up to +3.01) but leaves utilisation depth unchanged (Rescue Index ≈ 0). Better tool documentation can teach agents which tools to call, but cannot teach them to call those tools with the iterative depth that expert practice demands.

3. Evaluation depth, not tool knowledge, is the bottleneck

Across 836 task–condition observations, evaluation depth per candidate correlates with total score at ρ = 0.685 (p < 10^-117). LLM agents generate backbone candidates at expert-level rates but evaluate each one at only 14% of expert depth. Forced-depth interventions confirm this is causal — see the Depth Gap tab.

How to submit

Unlike most agent benchmarks, you do not host an HTTP endpoint. The 76 task descriptions never leave Romero Lab infrastructure. Instead you provide:

an LLM provider + API key (Anthropic / OpenAI / Google / DeepSeek). We run the BioDesignBench agent loop against your chosen model inside the leaderboard backend. Your key is scrubbed from our records immediately after the dispatch phase completes.
optionally, a custom MCP URL if you want to evaluate your own tool implementations. Otherwise, the agent calls our reference protein-design-mcp endpoint (in progress).

Data flow

Each task prompt is sent to your chosen LLM provider via their standard API (Anthropic, OpenAI, Google, DeepSeek) — that single channel is the only path by which task data leaves Romero Lab. The MCP server (reference or custom) only ever sees operational tool arguments (sequences, PDB paths, hotspot residues); it never sees the raw task prompt or evaluation criteria. Every task prompt also carries a unique 16-character canary token as an HTML comment, for retrospective leakage detection.

Bring your own tools (Custom MCP)

If you want to benchmark a new tool implementation (a faster structure predictor, a different diffusion backbone, your own stability model) against the same 76 tasks and rubric, stand up an HTTPS endpoint that satisfies the MCP contract and paste the URL into the submission form's Advanced: Custom MCP section:

Contract + hosting options: leaderboard README
Minimal FastAPI stub (~150 lines): example_mcp_server.py
Reference implementation to fork: RomeroLab/protein-design-mcp

Limits

Maximum 1 submission per calendar month per organization
73 hidden tasks are used for ranking; 3 public example tasks are available for development
LLM-judge API costs are paid by Romero Lab; your own agent LLM calls are billed to your provider

Scoring rubric (100 points, hybrid)

Scores combine 72 algorithmic points from deterministic biophysical metrics with 28 LLM-judge points assessed by a 3-judge panel (PoLL) with self-exclusion to mitigate self-preference bias. Each component is capped at its rubric maximum to prevent double counting.

Approach (20 pts) — strategic appropriateness of tool selection across 10 functional categories (backbone generation, inverse folding, structure prediction, etc.).

Orchestration (15 pts) — pipeline ordering, intermediate validation, and adaptive iteration.

Quality (35 pts) — 100% algorithmic. Continuous 4-band interpolation over Boltz-2 re-prediction metrics (pLDDT, pTM, ipTM, i_pAE), eliminating LLM judgement variance on biophysical quantities.

Feasibility (15 pts) — valid amino acids, length constraints, composition, and biophysical plausibility.

Novelty (5 pts) — sequence identity to reference (lower identity = more novel).

Diversity (10 pts) — number and pairwise diversity of generated designs.

Five-layer contamination defense

Every evaluated LLM may have read protein-design literature during pretraining, so we use a layered defense:

All 76 tasks derived from publications dated 2024–2026, post-dating model training cutoffs.
Task prompts paraphrased and restructured — no verbatim passages from source literature.
Targets specified by biological function and structural constraints, not by name or PDB identifier.
12 decoy tasks with deliberately fabricated targets to detect memorisation-based responses.
n-gram overlap analysis between agent outputs and source publications — no verbatim regurgitation above the 8-gram threshold across any condition.

Citation

@article{biodesignbench2026,
  title={Evaluating LLM-Driven Protein Design:
         Agents Lack Iterative Evaluation Depth},
  author={Kim, Jeonghyeon and Romero, Philip},
  year={2026}
}

🧬 BioDesignBench