Romero Lab · Duke University
๐งฌ BioDesignBench
Can LLM agents orchestrate stochastic protein-design pipelines?
Top-tier agents now surpass a deterministic pipeline — but invoke evaluation tools at only 14% of expert depth. Guidance rescues coverage, not depth.
| # | Agent | Organization | Mode | MCP | Score | Tasks | Zero-Score |
|---|---|---|---|---|---|---|---|
| ๐ | ๐ Human Oracle baseline | Romero Lab | โ | โ | 74.8 | 76/76 | 0 |
| ๐จโ๐ฌ | ๐จโ๐ฌ Human Expert baseline | Romero Lab | โ | โ | 61.2 | 76/76 | 0 |
| 1 | DeepSeek V3 | DeepSeek | benchmark | reference | 60.4 | 76/76 | 1 |
| 2 | DeepSeek V3 | DeepSeek | user | reference | 58.5 | 76/76 | 7 |
| 3 | GPT-5 | OpenAI | benchmark | reference | 55.6 | 76/76 | 2 |
| 4 | GPT-5 | OpenAI | user | reference | 55.3 | 76/76 | 4 |
| ๐ง | ๐ง Hardcoded Pipeline baseline | Deterministic | โ | โ | 54.2 | 76/76 | 0 |
| 5 | Claude Sonnet 4.5 | Anthropic | user | reference | 50.2 | 76/76 | 16 |
| 6 | Claude Sonnet 4.5 | Anthropic | benchmark | reference | 41.2 | 76/76 | 23 |
| 7 | Gemini 2.5 Pro | user | reference | 8.8 | 76/76 | 74 | |
| 8 | Gemini 2.5 Pro | benchmark | reference | 8.1 | 76/76 | 75 |
| Approach โ / Subject โ | Antibody | Binder | Enzyme | Scaffold | Fluorescent Prot. | Mean |
|---|---|---|---|---|---|---|
| De Novo Design | 79 n=4 | 72 n=19 | 76 n=2 | 76 n=21 | 79 n=1 | 76.2 |
| Redesign | 69 n=5 | n/a | 76 n=10 | 77 n=4 | 77 n=10 | 74.8 |
DeepSeek V3
GPT-5
Claude Sonnet 4.5
Gemini 2.5 Pro
Causal interventions on the depth gap
Causal intervention experiments on the depth gap. 18 representative tasks rerun under three conditions: baseline (no intervention), forced_depth (mandate โฅ3 evaluation passes per candidate), and low_diversity_control (constrain candidate count without forcing depth). Reruns are scored on a representative 18-task subset that spans all 9 occupied taxonomy cells.
| Run | Condition | Score | Δ vs baseline | Approach / Orch. | Quality | Diversity |
|---|---|---|---|---|---|---|
| DeepSeek V3 โ baseline | Baseline | 58.7 | โ | 13.4 / 11.2 | 16.1 | 3.6 |
| GPT-5 โ baseline | Baseline | 46.8 | โ | 8.3 / 6.2 | 15.4 | 3.9 |
| Human Expert โ baseline | Baseline | 56.7 | โ | 18.3 / 9.3 | 11.1 | 2.3 |
| DeepSeek V3 โ forced depth | Forced Depth | 68.1 | +9.3 | 18.4 / 12.3 | 16.1 | 3.9 |
| GPT-5 โ forced depth | Forced Depth | 62.7 | +15.9 | 18.3 / 11.7 | 15.0 | 3.1 |
| DeepSeek V3 โ low diversity | Low-Diversity Control | 56.4 | -2.3 | 13.1 / 11.1 | 16.0 | 3.2 |
| GPT-5 โ low diversity | Low-Diversity Control | 61.5 | +14.7 | 13.1 / 12.0 | 16.2 | 3.2 |
| Human Expert โ shallow | Low-Diversity Control | 55.1 | -1.6 | 18.2 / 9.3 | 11.2 | 0.6 |
Scoring uses the same 100-point hybrid rubric as the main leaderboard but is restricted to 18 representative tasks; absolute values therefore differ from the full-benchmark mean. The delta vs baseline compares each agent against its own untreated baseline run, isolating the intervention effect.
Submit your agent
BioDesignBench evaluates models inside Romero Lab infrastructure to keep the 76 task specifications contamination-clean. You provide an LLM API key and a model name, and we run the BioDesignBench agent loop against your model with the reference 17-tool MCP server. Task content never leaves Romero Lab except through your chosen LLM provider's API call.
- Your API key is stored on the submission row only between submission and dispatch, then scrubbed automatically regardless of whether the run succeeded.
- Each task carries a unique 16-character canary token (invisible HTML comment) so we can retrospectively detect leakage in published models.
- The MCP server (reference or custom) sees only operational tool arguments, never the raw task description or evaluation criteria.
- Reference (default): your agent uses our hosted protein-design-mcp endpoint. Eligible for the reference ranking.
- Custom: provide your own
public MCP URL implementing the same 17-tool
schema. Useful for benchmarking new tool
implementations against an identical model
under identical task prompts. Tagged with a
custombadge.
Submission status
Check your submission status or manage the pipeline (admin only).
What is BioDesignBench?
BioDesignBench is a benchmark for evaluating LLM agents as orchestrators of multi-step stochastic protein-design pipelines. Unlike chemistry- or code-agent benchmarks, where tool chains are largely deterministic, protein design demands repeated sampling from generative tools (RFdiffusion, ProteinMPNN) and iterative cross-validation through several biophysical metrics. We test the full agentic loop — plan → sample → evaluate across multiple metrics → iterate — over 76 expert-curated tasks drawn from 2024–2026 literature, exposed through 17 MCP-integrated tools.
(2 approaches ร 5 subjects)
Three principal findings
1. Top-tier agents now beat a deterministic pipeline
DeepSeek V3 and GPT-5 surpass a hand-engineered hardcoded pipeline (54.2) under both modes. Autonomous protein-design orchestration is no longer infeasible — but a substantial gap to the human expert (61.3) and oracle (74.9) remains.
2. Coverage–depth dissociation
Workflow guidance closes the coverage gap (Rescue Index up to +3.01) but leaves utilisation depth unchanged (Rescue Index โ 0). Better tool documentation can teach agents which tools to call, but cannot teach them to call those tools with the iterative depth that expert practice demands.
3. Evaluation depth, not tool knowledge, is the bottleneck
Across 836 task–condition observations, evaluation depth per candidate correlates with total score at ρ = 0.685 (p < 10-117). LLM agents generate backbone candidates at expert-level rates but evaluate each one at only 14% of expert depth. Forced-depth interventions confirm this is causal — see the Depth Gap tab.
How to submit
Unlike most agent benchmarks, you do not host an HTTP endpoint. The 76 task descriptions never leave Romero Lab infrastructure. Instead you provide:
- an LLM provider + API key (Anthropic / OpenAI / Google / DeepSeek). We run the BioDesignBench agent loop against your chosen model inside the leaderboard backend. Your key is scrubbed from our records immediately after the dispatch phase completes.
- optionally, a custom MCP URL if you want to evaluate your own tool implementations. Otherwise, the agent calls our reference protein-design-mcp endpoint (in progress).
Data flow
Each task prompt is sent to your chosen LLM provider via their standard API (Anthropic, OpenAI, Google, DeepSeek) — that single channel is the only path by which task data leaves Romero Lab. The MCP server (reference or custom) only ever sees operational tool arguments (sequences, PDB paths, hotspot residues); it never sees the raw task prompt or evaluation criteria. Every task prompt also carries a unique 16-character canary token as an HTML comment, for retrospective leakage detection.
Bring your own tools (Custom MCP)
If you want to benchmark a new tool implementation (a faster structure predictor, a different diffusion backbone, your own stability model) against the same 76 tasks and rubric, stand up an HTTPS endpoint that satisfies the MCP contract and paste the URL into the submission form's Advanced: Custom MCP section:
- Contract + hosting options: leaderboard README
- Minimal FastAPI stub (~150 lines):
example_mcp_server.py - Reference implementation to fork: RomeroLab/protein-design-mcp
Limits
- Maximum 1 submission per calendar month per organization
- 73 hidden tasks are used for ranking; 3 public example tasks are available for development
- LLM-judge API costs are paid by Romero Lab; your own agent LLM calls are billed to your provider
Scoring rubric (100 points, hybrid)
Scores combine 72 algorithmic points from deterministic biophysical metrics with 28 LLM-judge points assessed by a 3-judge panel (PoLL) with self-exclusion to mitigate self-preference bias. Each component is capped at its rubric maximum to prevent double counting.
Approach (20 pts) — strategic appropriateness of tool selection across 10 functional categories (backbone generation, inverse folding, structure prediction, etc.).
Orchestration (15 pts) — pipeline ordering, intermediate validation, and adaptive iteration.
Quality (35 pts) — 100% algorithmic. Continuous 4-band interpolation over Boltz-2 re-prediction metrics (pLDDT, pTM, ipTM, i_pAE), eliminating LLM judgement variance on biophysical quantities.
Feasibility (15 pts) — valid amino acids, length constraints, composition, and biophysical plausibility.
Novelty (5 pts) — sequence identity to reference (lower identity = more novel).
Diversity (10 pts) — number and pairwise diversity of generated designs.
Five-layer contamination defense
Every evaluated LLM may have read protein-design literature during pretraining, so we use a layered defense:
- All 76 tasks derived from publications dated 2024–2026, post-dating model training cutoffs.
- Task prompts paraphrased and restructured — no verbatim passages from source literature.
- Targets specified by biological function and structural constraints, not by name or PDB identifier.
- 12 decoy tasks with deliberately fabricated targets to detect memorisation-based responses.
- n-gram overlap analysis between agent outputs and source publications — no verbatim regurgitation above the 8-gram threshold across any condition.
Citation
@article{biodesignbench2026,
title={Evaluating LLM-Driven Protein Design:
Agents Lack Iterative Evaluation Depth},
author={Kim, Jeonghyeon and Romero, Philip},
year={2026}
}