Romero Lab · Duke University

๐Ÿงฌ BioDesignBench

Can LLM agents orchestrate stochastic protein-design pipelines?

Top-tier agents now surpass a deterministic hardcoded pipeline — but invoke evaluation tools at only 14% of expert intensity. Guidance closes the coverage gap, not the evaluation-depth gap.

76 tasks · 5 molecular families 17 MCP tools 11 conditions Updated 2026-04-14
Finding 1
Top-tier LLM agents (DeepSeek V3, GPT-5) now surpass the deterministic hardcoded pipeline.
Finding 2
All agents show a critical evaluation-depth gap โ€” they invoke evaluation tools at only ~14% of expert intensity.
Finding 3
Workflow guidance closes the coverage gap but leaves the evaluation-depth gap unchanged.
Finding 4
Evaluation variety (distinct metric categories per candidate) predicts design quality (ฯ = 0.68, p < 10โปยนยนโต) beyond binary tool selection.
Finding 5
Forced-depth intervention lifts the strongest agent (DeepSeek V3) by +9.3 points on 18 tasks, while a compute-matched low-variety control hurts it (-2.3) โ€” evidence that variety, not raw compute, drives the gain.
Mode
MCP Tools
Show
# Agent Organization Mode MCP Score Tasks Zero-Score
๐Ÿ“„๐Ÿ“„ Human Oracle baselineRomeroLabโ€”โ€”
74.8
76/760
๐Ÿ‘จโ€๐Ÿ”ฌ๐Ÿ‘จโ€๐Ÿ”ฌ Human Expert baselineRomeroLabโ€”โ€”
61.2
76/760
1DeepSeek V3RomeroLabunguidedreference
60.4
76/761
2DeepSeek V3RomeroLabguidedreference
58.5
76/767
3GPT-5RomeroLabunguidedreference
55.6
76/762
4GPT-5RomeroLabguidedreference
55.3
76/764
๐Ÿ”ง๐Ÿ”ง Hardcoded Pipeline baselineRomeroLabโ€”โ€”
54.2
76/760
5Claude Sonnet 4.5RomeroLabguidedreference
50.2
76/7616
6Claude Sonnet 4.5RomeroLabunguidedreference
41.2
76/7623
7Gemini 2.5 ProRomeroLabguidedreference
8.8
76/7674
8Gemini 2.5 ProRomeroLabunguidedreference
8.1
76/7675