📝 MedCalc-Bench: Evaluating Large Language Models for Medical Calculations
ft. Nikhil Khandekar, Soren Dunn, and the National Institutes of Health
We are excited to have collaborated on MedCalc-Bench, with the National Library of Medicine, National Institutes of Health
“This was a highly interdisciplinary project, where we had individuals not only from computational backgrounds, but also had MD students from Yale, UIC, UChicago, and Rosalind Franklin University who helped us ensure that the patient notes in the dataset were appropriate for each calculation task.”
¹
- Nikhil Khandekar¹, student researcher
Paper: https://arxiv.org/pdf/2406.12036 (under review)
MedCalc-Bench is the first computation-based benchmark for evaluating LLMs in clinical settings.
“Problems can require anywhere from 2 to 31 steps, and are focused on either equation or rule-based calculators from MDCalc.com. Each instance in the dataset contains a patient note, a question asking for a specific medical value, the final ground truth answer, and a step-by-step explanation of how the answer was obtained. Hence, our dataset can be seen as the equivalent of the GSM8k benchmark but for medical settings. In total, there are 10,053 training data instances and 1,047 test data instances.”
¹
Compute resources for OpenAI models were provided at NIH. For running the open-source models for just inference-based settings, compute at the University of Virginia was used. Lastly, resources from the Delta clusters provided to us by Lapis Labs were used to evaluate the performance of fine-tuned Llama2-7B and Mistral-7B using our training dataset.