📝 MedCalc-Bench: Evaluating Large Language Models for Medical Calculations

ft. Nikhil Khandekar, Soren Dunn, and the National Institutes of Health

Jul 03, 2024

We are excited to have collaborated on MedCalc-Bench, with the National Library of Medicine, National Institutes of Health

“This was a highly interdisciplinary project, where we had individuals not only from computational backgrounds, but also had MD students from Yale, UIC, UChicago, and Rosalind Franklin University who helped us ensure that the patient notes in the dataset were appropriate for each calculation task.”¹

- Nikhil Khandekar¹, student researcher

Paper: https://arxiv.org/pdf/2406.12036 (under review)
Dataset: https://huggingface.co/datasets/ncbi/MedCalc-Bench

MedCalc-Bench is the first computation-based benchmark for evaluating LLMs in clinical settings.

“Problems can require anywhere from 2 to 31 steps, and are focused on either equation or rule-based calculators from MDCalc.com. Each instance in the dataset contains a patient note, a question asking for a specific medical value, the final ground truth answer, and a step-by-step explanation of how the answer was obtained. Hence, our dataset can be seen as the equivalent of the GSM8k benchmark but for medical settings. In total, there are 10,053 training data instances and 1,047 test data instances.”¹

Compute resources for OpenAI models were provided at NIH. For running the open-source models for just inference-based settings, compute at the University of Virginia was used. Lastly, resources from the Delta clusters provided to us by Lapis Labs were used to evaluate the performance of fine-tuned Llama2-7B and Mistral-7B using our training dataset.

Lapis Labs

📝 MedCalc-Bench: Evaluating Large Language Models for Medical Calculations

ft. Nikhil Khandekar, Soren Dunn, and the National Institutes of Health

Discussion about this post