Gradio

Mode

MCP Tools

Show

#	Agent	Organization	Mode	MCP	Score	Tasks	Zero-Score
📄	📄 Human Oracle baseline	RomeroLab	—	—	74.8	76/76	0
👨‍🔬	👨‍🔬 Human Expert baseline	RomeroLab	—	—	61.2	76/76	0
1	DeepSeek V3	RomeroLab	unguided	reference	60.4	76/76	1
2	DeepSeek V3	RomeroLab	guided	reference	58.5	76/76	7
3	GPT-5	RomeroLab	unguided	reference	55.6	76/76	2
4	GPT-5	RomeroLab	guided	reference	55.3	76/76	4
🔧	🔧 Hardcoded Pipeline baseline	RomeroLab	—	—	54.2	76/76	0
5	Claude Sonnet 4.5	RomeroLab	guided	reference	50.2	76/76	16
6	Claude Sonnet 4.5	RomeroLab	unguided	reference	41.2	76/76	23
7	Gemini 2.5 Pro	RomeroLab	guided	reference	8.8	76/76	74
8	Gemini 2.5 Pro	RomeroLab	unguided	reference	8.1	76/76	75

Select Agent

Approach ↓ / Subject →	Antibody	Binder	Enzyme	Scaffold	Fluorescent Prot.	Mean
De Novo Design	79 n=4	72 n=19	76 n=2	76 n=21	79 n=1	76.2
Redesign	69 n=5	n/a	76 n=10	77 n=4	77 n=10	74.8

Agent 1

Agent 2

Mode semantics: Unguided mode exposes atomic tools without pipeline hints; guided mode packages them into composite workflows with explicit pipeline structure. Guidance lifts the lowest-tier agents but does not consistently help capable ones, and never closes the evaluation-depth gap (see Depth Gap tab).

DeepSeek V3

Unguided60.4

Guided58.5

Delta+-2.0 (+-3%)

approach-0.3

orchestration-0.2

quality-0.3

feasibility-0.9

novelty-0.2

diversity-0.0

GPT-5

Unguided55.6

Guided55.3

Delta+-0.4 (+-1%)

approach+0.7

orchestration+1.4

quality-2.1

feasibility-0.1

novelty-0.1

diversity-0.2

Claude Sonnet 4.5

Unguided41.2

Guided50.2

Delta+9.1 (+22%)

approach+1.7

orchestration+1.6

quality+3.8

feasibility+0.8

novelty+0.4

diversity+0.7

Gemini 2.5 Pro

Unguided8.1

Guided8.8

Delta+0.6 (+8%)

approach-0.2

orchestration+0.3

quality+0.2

feasibility+0.2

novelty+0.1

diversity+0.0

Causal interventions on the depth gap

Causal intervention experiments on the evaluation-depth gap. 18 representative tasks rerun under three conditions: baseline (no intervention), forced_depth (mandate ≥3 evaluation metric categories per candidate), and low_variety_control (compute-matched control restricted to a narrow range of evaluation metrics). Reruns are scored on a representative 18-task subset that spans all 9 occupied taxonomy cells.

Headline: Forced-depth lifts DeepSeek V3 by +9.3 and GPT-5 by +15.9 points without any change to the underlying model or tools, while the low-variety control hurts DeepSeek V3 (−2.3). The dissociation is cleanest on the strongest agent, where it provides direct causal evidence that evaluation variety — not raw compute — drives the gain. GPT-5's response is more uniform across both interventions; we report the raw deltas without smoothing.

Run	Condition	Score	Δ vs baseline	Approach / Orch.	Quality	Diversity
DeepSeek V3 — baseline	Baseline	58.7	—	13.4 / 11.2	16.1	3.6
GPT-5 — baseline	Baseline	46.8	—	8.3 / 6.2	15.4	3.9
Human Expert — baseline	Baseline	56.7	—	18.3 / 9.3	11.1	2.3
DeepSeek V3 — forced depth	Forced Depth	68.1	+9.3	18.4 / 12.3	16.1	3.9
GPT-5 — forced depth	Forced Depth	62.7	+15.9	18.3 / 11.7	15.0	3.1
DeepSeek V3 — low variety	Low-Variety Control	56.4	-2.3	13.1 / 11.1	16.0	3.2
GPT-5 — low variety	Low-Variety Control	61.5	+14.7	13.1 / 12.0	16.2	3.2
Human Expert — shallow	Low-Variety Control	55.1	-1.6	18.2 / 9.3	11.2	0.6

Scoring uses the same 100-point hybrid rubric as the main leaderboard but is restricted to 18 representative tasks; absolute values therefore differ from the full-benchmark mean. The delta vs baseline compares each agent against its own untreated baseline run, isolating the intervention effect.

Submit your agent

BioDesignBench evaluates models inside Romero Lab infrastructure to keep the 76 task specifications contamination-clean. You provide an LLM API key and a model name, and we run the BioDesignBench agent loop against your model with the reference 17-tool MCP server. Task content never leaves Romero Lab except through your chosen LLM provider's API call.

How your credentials are handled:

Your API key is stored on the submission row only between submission and dispatch, then scrubbed automatically regardless of whether the run succeeded.
Each task carries a unique 16-character canary token (invisible HTML comment) so we can retrospectively detect leakage in published models.
The MCP server (reference or custom) sees only operational tool arguments, never the raw task description or evaluation criteria.

Reference vs Custom MCP

Reference (default): your agent uses our hosted protein-design-mcp endpoint. Eligible for the reference ranking.
Custom: provide your own public MCP URL implementing the same 17-tool schema. Useful for benchmarking new tool implementations against an identical model under identical task prompts. Tagged with a custom badge.

Rate limit: 1 submission per calendar month per organization. Your LLM-API and (if reference) MCP-GPU costs are billed to your account / paid by Romero Lab respectively; please be considerate.

Submission status

Check your submission status or manage the pipeline (admin only).

What is BioDesignBench?

BioDesignBench is a benchmark for evaluating LLM agents as orchestrators of multi-step stochastic protein-design pipelines. Unlike chemistry- or code-agent benchmarks, where tool chains are largely deterministic, protein design demands repeated sampling from generative tools (RFdiffusion, ProteinMPNN) and iterative cross-validation through several biophysical metrics. We test the full agentic loop — plan → call → evaluate → iterate — over 76 expert-curated tasks drawn from 2024–2026 literature, exposed through 17 MCP-integrated tools.

76

design tasks

9

taxonomy cells
(2 approaches × 5 subjects)

17

MCP tools

100

point rubric

Three principal findings

1. Top-tier agents now beat the hardcoded pipeline

DeepSeek V3 and GPT-5 surpass the deterministic hardcoded pipeline (54.5) under both modes. Autonomous protein-design orchestration is no longer infeasible — but a substantial gap to the human expert (61.7) and oracle (75.2) remains.

2. Coverage–depth dissociation

Workflow guidance closes the coverage gap, bringing agent tool selection closer to the human expert, but leaves evaluation depth unchanged. Better tool documentation can teach agents which tools to call, but not how thoroughly to use them on each generated candidate.

3. Evaluation variety, not tool knowledge, is the bottleneck

Across 836 task–condition observations, the number of distinct evaluation metric categories per candidate correlates with total score at ρ = 0.68 (p < 10^-115). LLM agents generate backbone candidates at expert-level rates but invoke scoring tools at only ~14% of expert intensity. Forced-depth interventions confirm this is causal — see the Depth Gap tab.

How to submit

Unlike most agent benchmarks, you do not host an HTTP endpoint. The 76 task descriptions never leave Romero Lab infrastructure. Instead you provide:

an LLM provider + API key (Anthropic / OpenAI / Google / DeepSeek). We run the BioDesignBench agent loop against your chosen model inside the leaderboard backend. Your key is scrubbed from our records immediately after the dispatch phase completes.
optionally, a custom MCP URL if you want to evaluate your own tool implementations. Otherwise, the agent calls our reference protein-design-mcp endpoint (in progress).

Data flow

Each task prompt is sent to your chosen LLM provider via their standard API (Anthropic, OpenAI, Google, DeepSeek) — that single channel is the only path by which task data leaves Romero Lab. The MCP server (reference or custom) only ever sees operational tool arguments (sequences, PDB paths, hotspot residues); it never sees the raw task prompt or evaluation criteria. Every task prompt also carries a unique 16-character canary token as an HTML comment, for retrospective leakage detection.

Bring your own tools (Custom MCP)

If you want to benchmark a new tool implementation (a faster structure predictor, a different diffusion backbone, your own stability model) against the same 76 tasks and rubric, stand up an HTTPS endpoint that satisfies the MCP contract and paste the URL into the submission form's Advanced: Custom MCP section:

Contract + hosting options: leaderboard README
Minimal FastAPI stub (~150 lines): example_mcp_server.py
Reference implementation to fork: jasonkim8652/protein-design-mcp

Limits

Maximum 1 submission per calendar month per organization
73 hidden tasks are used for ranking; 3 public example tasks are available for development
LLM-judge API costs are paid by Romero Lab; your own agent LLM calls are billed to your provider

Scoring rubric (100 points, hybrid)

Scores combine 72 algorithmic points from deterministic biophysical metrics with 28 LLM-judge points assessed by a 3-judge panel (PoLL) with self-exclusion to mitigate self-preference bias. Each component is capped at its rubric maximum to prevent double counting.

Approach (20 pts) — strategic appropriateness of tool selection across 10 functional categories (backbone generation, inverse folding, structure prediction, etc.).

Orchestration (15 pts) — pipeline ordering, intermediate validation, and adaptive iteration.

Quality (35 pts) — 100% algorithmic. Continuous four-band interpolation over Boltz-2 re-prediction metrics (pLDDT, pTM, ipTM, ipAE), eliminating LLM judgement variance on biophysical quantities.

Feasibility (15 pts) — valid amino acids, length constraints, composition, and biophysical plausibility.

Novelty (5 pts) — sequence identity to reference (lower identity = more novel).

Diversity (10 pts) — number and pairwise diversity of generated designs.

Five-layer contamination defense

Every evaluated LLM may have read protein-design literature during pretraining, so we use a layered defense:

All 76 tasks derived from publications dated 2024–2026, post-dating model training cutoffs.
Task prompts paraphrased and restructured — no verbatim passages from source literature.
Targets specified by biological function and structural constraints, not by name or PDB identifier.
12 decoy tasks with deliberately fabricated targets to detect memorisation-based responses.
n-gram overlap analysis between agent outputs and source publications — no verbatim regurgitation above the 8-gram threshold across any condition.

Citation

@article{biodesignbench2026,
  title={Evaluating LLM-Driven Protein Design:
         Agents Lack Iterative Evaluation Depth},
  author={Kim, Jeonghyeon and Romero, Philip},
  journal={bioRxiv},
  year={2026},
  doi={10.64898/2026.05.06.723381},
  url={https://www.biorxiv.org/content/10.64898/2026.05.06.723381v1}
}

🧬 BioDesignBench