Romero Lab · Duke University

๐Ÿงฌ BioDesignBench

Can LLM agents orchestrate stochastic protein-design pipelines?

Top-tier agents now surpass a deterministic pipeline — but invoke evaluation tools at only 14% of expert depth. Guidance rescues coverage, not depth.

76 tasks · 5 molecular families 17 MCP tools 11 conditions Updated 2026-04-14
Finding 1
Top-tier LLM agents (DeepSeek V3, GPT-5) now surpass a deterministic hardcoded pipeline.
Finding 2
All agents show a critical evaluation depth gap โ€” they invoke evaluation tools at only 14% of expert frequency.
Finding 3
Workflow guidance rescues tool coverage (Rescue Index up to +3.01) but not utilisation depth (Rescue Index โ‰ˆ 0).
Finding 4
Evaluation depth predicts design quality (ฯ = 0.685, p < 10โปยนยนโท) beyond binary tool selection.
Finding 5
Forced-depth intervention lifts the strongest agent (DeepSeek V3) by +9.3 points on 18 tasks, while a low-diversity control hurts it (-2.3) โ€” evidence that depth, not process change alone, drives the gain.
Mode
MCP Tools
Show
# Agent Organization Mode MCP Score Tasks Zero-Score
๐Ÿ“„๐Ÿ“„ Human Oracle baselineRomero Labโ€”โ€”
74.8
76/760
๐Ÿ‘จโ€๐Ÿ”ฌ๐Ÿ‘จโ€๐Ÿ”ฌ Human Expert baselineRomero Labโ€”โ€”
61.2
76/760
1DeepSeek V3DeepSeekbenchmarkreference
60.4
76/761
2DeepSeek V3DeepSeekuserreference
58.5
76/767
3GPT-5OpenAIbenchmarkreference
55.6
76/762
4GPT-5OpenAIuserreference
55.3
76/764
๐Ÿ”ง๐Ÿ”ง Hardcoded Pipeline baselineDeterministicโ€”โ€”
54.2
76/760
5Claude Sonnet 4.5Anthropicuserreference
50.2
76/7616
6Claude Sonnet 4.5Anthropicbenchmarkreference
41.2
76/7623
7Gemini 2.5 ProGoogleuserreference
8.8
76/7674
8Gemini 2.5 ProGooglebenchmarkreference
8.1
76/7675