Validating scientific hypotheses is a central challenge in biomedical research, and remains difficult for artificial intelligence (AI) agents due to the complexity of real-world data analysis and evidence interpretation. In this work, we present BioDSA-1K, a benchmark designed to evaluate AI agents on realistic, data-driven biomedical hypothesis validation tasks. BioDSA-1K consists of 1,029 hypothesis-centric tasks paired with 1,177 analysis plans, curated from over 300 published biomedical studies to reflect the structure and reasoning found in authentic research workflows. Each task includes a structured hypothesis derived from the original study's conclusions, expressed in the affirmative to reflect the language of scientific reporting, and one or more pieces of supporting evidence grounded in empirical data tables. While these hypotheses mirror published claims, they remain testable using standard statistical or machine learning methods. The benchmark enables evaluation along four axes: (1) hypothesis decision accuracy, (2) alignment between evidence and conclusion, (3) correctness of the reasoning process, and (4) executability of the AI-generated analysis code. Importantly, BioDSA-1K includes non-verifiable hypotheses: cases where the available data are insufficient to support or refute a claim, reflecting a common yet underexplored scenario in real-world science. We propose BioDSA-1K as a foundation for building and evaluating generalizable, trustworthy AI agents for biomedical discovery.
The below are examples of hypotheses and analyses extracted from published biomedical studies in BIODSA-1K.
{
"PMID": 38995739,
"Title": "A Phase II Study Assessing Long-term Response to Ibrutinib Monotherapy in Recurrent or Refractory CNS Lymphoma.",
"Abstract": "Ibrutinib is a first-in-class inhibitor of Bruton tyrosine kinase. We previously reported the safety and short-term antitumor activity of ibrutinib in 20 patients with relapsed or refractory (r/r) primary central nervous system (CNS) lymphoma (PCNSL) or secondary CNS lymphoma (SCNSL). We enrolled 26 additional patients with r/r PCNSL/SCNSL into the dose-expansion cohort of the trial into a combined cohort of 46 patients (31 with PCNSL and 15 with SCNSL). Patients received ibrutinib at 560 or 840 mg daily in the dose-escalation cohort and ibrutinib at 840 mg daily in the expansion cohort. The median follow-up was 49.9 and 62.1 months for patients with PCNSL and SCNSL, respectively. We sequenced DNA from available tumor biopsies and cerebrospinal fluid collected before and during ibrutinib therapy. Tumor responses were observed in 23/31 (74%) patients with PCNSL and 9/15 (60%) patients with SCNSL, including 12 complete responses in PCNSL and 7 in SCNSL. The median progression-free survival (PFS) for PCNSL was 4.5 months [95% confidence interval (CI), 2.8-9.2] with 1-year PFS at 23.7% (95% CI, 12.4%-45.1%). The median duration of response in the 23 PCNSL responders was 5.5 months. The median PFS in SCNSL was 5.3 months (95% CI, 1.3-14.5) with a median duration of response of 8.7 months for the 9 responders. Exploratory biomarker analysis suggests that mutations in TBL1XR1 may be associated with a long-term response to ibrutinib in PCNSL (P = 0.0075). Clearance of ctDNA from cerebrospinal fluid was associated with complete and long-term ibrutinib responses. Our study confirms single-agent activity of ibrutinib in r/r CNS lymphoma and identifies molecular determinants of response based on long-term follow-up.",
"Results": "",
"dataset_ids": [
"pcnsl_msk_2024"
],
"hypotheses": [
{
"hypothesis": "Ibrutinib monotherapy is effective in achieving tumor response in patients with relapsed or refractory primary central nervous system lymphoma (PCNSL).",
"wrong_hypothesis": "Ibrutinib monotherapy is not effective in achieving tumor response in patients with relapsed or refractory primary central nervous system lymphoma (PCNSL).",
"supporting_evidences": [
{
"analysis_plan": "Calculate the proportion of patients with PCNSL who achieved a tumor response after ibrutinib treatment.",
"evidence": "Tumor responses were observed in 23 out of 31 patients with PCNSL.",
"analysis_variables": [
"Patient_ID",
"PCNSL_Status",
"Tumor_Response"
],
"result_variable": "Proportion of responders",
"result_variable_value": 0.74
},
{
"analysis_plan": "Perform a survival analysis to determine the median progression-free survival (PFS) for PCNSL patients.",
"evidence": "The median PFS for PCNSL was 4.5 months.",
"analysis_variables": [
"Patient_ID",
"PCNSL_Status",
"Progression_Free_Survival"
],
"result_variable": "Median PFS",
"result_variable_value": 4.5
}
]
},
{
"hypothesis": "Mutations in TBL1XR1 are associated with a long-term response to ibrutinib in PCNSL patients.",
"wrong_hypothesis": "Mutations in TBL1XR1 are not associated with a long-term response to ibrutinib in PCNSL patients.",
"supporting_evidences": [
{
"analysis_plan": "Perform a statistical test to assess the association between TBL1XR1 mutations and long-term response to ibrutinib.",
"evidence": "Mutations in TBL1XR1 may be associated with a long-term response to ibrutinib in PCNSL (P = 0.0075).",
"analysis_variables": [
"Patient_ID",
"TBL1XR1_Mutation_Status",
"Long_Term_Response"
],
"result_variable": "P-value",
"result_variable_value": 0.0075
}
]
},
{
"hypothesis": "Clearance of ctDNA from cerebrospinal fluid is associated with complete and long-term ibrutinib responses.",
"wrong_hypothesis": "Clearance of ctDNA from cerebrospinal fluid is not associated with complete and long-term ibrutinib responses.",
"supporting_evidences": [
{
"analysis_plan": "Analyze the correlation between ctDNA clearance and complete/long-term response to ibrutinib.",
"evidence": "Clearance of ctDNA from cerebrospinal fluid was associated with complete and long-term ibrutinib responses.",
"analysis_variables": [
"Patient_ID",
"ctDNA_Clearance_Status",
"Complete_Response",
"Long_Term_Response"
],
"result_variable": "Association",
"result_variable_value": "Positive"
}
]
}
]
}
{
"PMID": 39214094,
"Title": "Distinct clinical outcomes and biological features of specific KRAS mutants in human pancreatic cancer.",
"Abstract": "KRAS mutations in pancreatic ductal adenocarcinoma (PDAC) are suggested to vary in oncogenicity but the implications for human patients have not been explored in depth. We examined 1,360 consecutive PDAC patients undergoing surgical resection and find that KRASG12R mutations are enriched in early-stage (stage I) disease, owing not to smaller tumor size but increased node-negativity. KRASG12R tumors are associated with decreased distant recurrence and improved survival as compared to KRASG12D. To understand the biological underpinnings, we performed spatial profiling of 20 patients and bulk RNA-sequencing of 100 tumors, finding enhanced oncogenic signaling and epithelial-mesenchymal transition (EMT) in KRASG12D and increased nuclear factor \u03baB (NF-\u03baB) signaling in KRASG12R tumors. Orthogonal studies of mouse KrasG12R PDAC organoids show decreased migration and improved survival in orthotopic models. KRAS alterations in PDAC are thus associated with distinct presentation, clinical outcomes, and biological behavior, highlighting the prognostic value of mutational analysis and the importance of articulating mutation-specific PDAC biology.",
"Results": "",
"dataset_ids": [
"pancreas_msk_2024"
],
"hypotheses": [
{
"hypothesis": "KRASG12R mutations are associated with improved survival compared to KRASG12D mutations in PDAC patients.",
"wrong_hypothesis": "KRASG12R mutations are associated with worse survival compared to KRASG12D mutations in PDAC patients.",
"supporting_evidences": [
{
"analysis_plan": "Perform a survival analysis comparing overall survival between patients with KRASG12R and KRASG12D mutations.",
"evidence": "KRASG12R tumors are associated with improved survival compared to KRASG12D.",
"analysis_variables": [
"KRAS_mutation_type",
"overall_survival_time",
"survival_status"
],
"result_variable": "hazard_ratio",
"result_variable_value": "<1 (indicating improved survival for KRASG12R)"
}
]
},
{
"hypothesis": "KRASG12R mutations are enriched in early-stage (stage I) PDAC compared to other KRAS mutations.",
"wrong_hypothesis": "KRASG12R mutations are not enriched in early-stage (stage I) PDAC compared to other KRAS mutations.",
"supporting_evidences": [
{
"analysis_plan": "Calculate the proportion of KRASG12R mutations in stage I PDAC and compare it to the proportion of other KRAS mutations in the same stage.",
"evidence": "KRASG12R mutations are enriched in early-stage (stage I) disease.",
"analysis_variables": [
"KRAS_mutation_type",
"cancer_stage"
],
"result_variable": "proportion",
"result_variable_value": "Higher proportion of KRASG12R in stage I compared to other mutations"
}
]
},
{
"hypothesis": "KRASG12R tumors have increased node-negativity compared to KRASG12D tumors.",
"wrong_hypothesis": "KRASG12R tumors have decreased node-negativity compared to KRASG12D tumors.",
"supporting_evidences": [
{
"analysis_plan": "Compare the rate of node-negativity between KRASG12R and KRASG12D tumors.",
"evidence": "KRASG12R mutations are associated with increased node-negativity.",
"analysis_variables": [
"KRAS_mutation_type",
"node_status"
],
"result_variable": "node_negativity_rate",
"result_variable_value": "Higher node-negativity rate for KRASG12R"
}
]
},
{
"hypothesis": "KRASG12D tumors exhibit enhanced epithelial-mesenchymal transition (EMT) compared to KRASG12R tumors.",
"wrong_hypothesis": "KRASG12D tumors exhibit reduced epithelial-mesenchymal transition (EMT) compared to KRASG12R tumors.",
"supporting_evidences": [
{
"analysis_plan": "Perform RNA-sequencing analysis to compare EMT-related gene expression between KRASG12D and KRASG12R tumors.",
"evidence": "Enhanced EMT in KRASG12D tumors.",
"analysis_variables": [
"KRAS_mutation_type",
"EMT_gene_expression"
],
"result_variable": "EMT_score",
"result_variable_value": "Higher EMT score for KRASG12D"
}
]
},
{
"hypothesis": "KRASG12R tumors show increased nuclear factor \u03baB (NF-\u03baB) signaling compared to KRASG12D tumors.",
"wrong_hypothesis": "KRASG12R tumors show decreased nuclear factor \u03baB (NF-\u03baB) signaling compared to KRASG12D tumors.",
"supporting_evidences": [
{
"analysis_plan": "Analyze NF-\u03baB signaling pathway activity using RNA-sequencing data from KRASG12R and KRASG12D tumors.",
"evidence": "Increased NF-\u03baB signaling in KRASG12R tumors.",
"analysis_variables": [
"KRAS_mutation_type",
"NF-\u03baB_pathway_activity"
],
"result_variable": "NF-\u03baB_activity_score",
"result_variable_value": "Higher NF-\u03baB activity score for KRASG12R"
}
]
}
]
}
{
"PMID": 39506116,
"Title": "Automated real-world data integration improves cancer outcome prediction.",
"Abstract": "The digitization of health records and growing availability of tumour DNA sequencing provide an opportunity to study the determinants of cancer outcomes with unprecedented richness. Patient data are often stored in unstructured text and siloed datasets. Here we combine natural language processing annotations1,2 with structured medication, patient-reported demographic, tumour registry and tumour genomic data from 24,950 patients\u00a0at Memorial Sloan Kettering\u00a0Cancer Center to\u00a0generate a clinicogenomic, harmonized oncologic real-world dataset (MSK-CHORD). MSK-CHORD includes data for non-small-cell lung (n\u2009=\u20097,809), breast (n\u2009=\u20095,368), colorectal (n\u2009=\u20095,543), prostate (n\u2009=\u20093,211) and pancreatic (n\u2009=\u20093,109) cancers and enables discovery of clinicogenomic relationships not apparent in smaller datasets. Leveraging MSK-CHORD to train machine learning models to predict overall survival, we find that models including features derived from natural language processing, such as sites of disease, outperform those based on genomic data or stage alone as tested by cross-validation and an external, multi-institution dataset. By annotating 705,241 radiology reports, MSK-CHORD also uncovers predictors of metastasis to specific organ sites, including a relationship between SETD2 mutation and lower metastatic potential in immunotherapy-treated lung adenocarcinoma corroborated in independent datasets. We demonstrate the feasibility of automated annotation from unstructured notes and its utility in predicting patient outcomes. The resulting data are provided as a public resource for real-world oncologic research.",
"Results": "",
"dataset_ids": [
"msk_chord_2024"
],
"hypotheses": [
{
"hypothesis": "Models including features derived from natural language processing outperform those based on genomic data or stage alone in predicting overall survival.",
"wrong_hypothesis": "Models based solely on genomic data outperform those including features derived from natural language processing in predicting overall survival.",
"supporting_evidences": [
{
"analysis_plan": "Train machine learning models using features derived from natural language processing and compare their performance to models using genomic data or stage alone through cross-validation.",
"evidence": "Models with NLP features showed higher predictive accuracy for overall survival compared to genomic data or stage alone.",
"analysis_variables": [
"NLP_features",
"genomic_data",
"cancer_stage"
],
"result_variable": "predictive_accuracy",
"result_variable_value": "Higher accuracy for NLP features model"
},
{
"analysis_plan": "Validate the performance of NLP-based models using an external, multi-institution dataset.",
"evidence": "NLP-based models maintained superior performance in external validation.",
"analysis_variables": [
"NLP_features",
"external_dataset"
],
"result_variable": "validation_accuracy",
"result_variable_value": "Superior performance in external dataset"
}
]
},
{
"hypothesis": "SETD2 mutation is associated with lower metastatic potential in immunotherapy-treated lung adenocarcinoma.",
"wrong_hypothesis": "SETD2 mutation is associated with higher metastatic potential in immunotherapy-treated lung adenocarcinoma.",
"supporting_evidences": [
{
"analysis_plan": "Analyze the correlation between SETD2 mutation status and metastatic potential in lung adenocarcinoma patients treated with immunotherapy.",
"evidence": "SETD2 mutation correlates with lower rates of metastasis.",
"analysis_variables": [
"SETD2_mutation_status",
"metastatic_potential"
],
"result_variable": "correlation_coefficient",
"result_variable_value": "Negative correlation"
},
{
"analysis_plan": "Corroborate findings using independent datasets to confirm the relationship between SETD2 mutation and metastasis.",
"evidence": "Independent datasets confirm lower metastatic potential in SETD2 mutated cases.",
"analysis_variables": [
"SETD2_mutation_status",
"independent_datasets"
],
"result_variable": "confirmation_rate",
"result_variable_value": "Consistent findings across datasets"
}
]
}
]
}
Compared to existing benchmarks, which often involve simpler, smaller, or less diverse datasets, our benchmark presents a significantly more challenging and realistic setting for evaluating AI agents on biomedical data science tasks.
AI agents tend to be conservative in hypothesis validation with higher rates of missed findings than false positives, with CodeGen showing Type II errors of 0.164 compared to Type I errors of 0.090 for Biomarkers. Reasoning-enhanced versions of both ReAct and CodeGen models consistently outperform their base counterparts, with ReAct-based methods generally achieving better results than CodeGen, particularly in complex domains like Genomics where reasoning augmentation reduced Type II errors from 0.191 to 0.107.
(a): when CodeGen method makes non-exectuable code, it sometimes hallucinates the hypothesis as true or false, which should be non-verifiable.
(b): For non-verifiable hypotheses, we further found all agent methods make hallucinated hypothesis decision as true or false, while the ReAct based methods are more likely to make correct decision.
@article{wang2025biodsa,
title = {BioDSA-1K: Benchmarking Data Science Agents for Biomedical Research},
author = {Wang, Zifeng and Danek, Benjamin and Sun, Jimeng},
year = {2025},
}