OpenAI unveils GeneBench-Pro to measure AI performance in computational biology

0

OpenAI has introduced GeneBench-Pro, a new benchmark designed to assess whether artificial intelligence models can perform the complex reasoning and judgement required in computational biology research.

‎The benchmark comprises 129 problems spanning genomics, quantitative biology and translational medicine. Rather than testing simple factual recall, each task presents AI models with a dataset, experimental context and a research question, requiring them to analyse the data, select an appropriate methodology and produce a final conclusion.

‎‎To validate the benchmark, OpenAI submitted 82 of the 129 problems to external specialists, including graduate students, postdoctoral researchers, industry scientists and university professors. The reviewers evaluated whether the problems reflected realistic biological research and whether the intended answers could be reliably identified.

‎Alexander Strudwick Young, an assistant professor of human genetics at UCLA, said the benchmark would challenge even well-trained researchers, noting that many of the questions would have been difficult for a graduate student to complete without guidance from an experienced supervisor.

‎‎Unlike many existing AI evaluations, every GeneBench-Pro problem is generated synthetically. OpenAI controls the entire data-generation process, allowing it to verify answers against known ground truth while accommodating reasonable differences in analytical methods that arrive at valid conclusions.

‎‎According to the company, its GPT-5.6 Sol model achieved a pass rate of 28.7% at the highest reasoning setting, rising to 31.5% when Pro mode was enabled. OpenAI said this represents a significant improvement from the original GeneBench, where GPT-5 scored below 5% during the benchmark’s development. At the lowest reasoning setting, GPT-5.6 Sol recorded only a single-digit pass rate.

‎‎Among competing models, Anthropic’s Opus 4.8 achieved a pass rate of 16.0%, while Google’s Gemini 3.5 Flash scored 8.1%. Gemini 3.1 Pro recorded 3.1%, DeepSeek V4 Pro achieved 2.4%, GLM 5.2 reached 4.6%, and xAI’s Grok 4.3 scored 1.5%.

‎‎OpenAI said reviewers estimated that a typical GeneBench-Pro task would require between 20 and 40 hours of work by a human expert. Based on an estimated labour cost of 200 US dollars per hour, completing a single problem could cost several thousand dollars, compared with AI inference costs of only a few dollars per task.

‎‎To encourage broader evaluation of advanced AI systems, OpenAI said it will open-source 10 representative GeneBench-Pro questions on Hugging Face and provide a separate 50-question subset to Artificial Analysis for independent benchmarking.

LEAVE A REPLY

Please enter your comment!
Please enter your name here