Automatic extraction and validation of reported performances in ML papers

Assessing and comparing the performances of AI methods is difficult as the number of papers is increasing every year and each paper claims an improvement over previous methods.

The difficulty in tracking dataset use results in problems collecting the results for a particular dataset as well. Furthermore, it is sometimes hard to reproduce the results from a paper as the code may not be released.

Machine learning algorithms are often evaluated only using a performance metric such as the AUC for classification tasks or the Dice score for segmentation tasks. However, it is unclear whether an improvement on these metrics really creates a difference from a clinical perspective, especially with statistically significant but small differences sometimes seen in research papers.

The goal of this project is therefore to (semi)-automatically extract the results for a dataset in research papers, study the evolution of performances and assess whether the improvement in performance metrics has an impact from a clinical perspective.

Related papers:

  • Sourget, T., Akkoç, A., Winther, S., Galsgaard, C.L., Jiménez-Sánchez, A., Juodelyte, D., Petitjean, C., Cheplygina, V.: [Citation needed] data usage and citation practices in medical imaging conferences. In: Medical Imaging with Deep Learing (MIDL), in press (2024)
  • Varoquaux, G., Cheplygina, V.: Machine learning for medical imaging: methodological failures and recommendations for the future. Nature Digital Medicine 5(1), 1–8 (2022)
  • Christodoulou, E., Ma, J., Collins, G.S., Steyerberg, E.W., Verbakel, J.Y., Van Calster, B.: A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. Journal of clinical epidemiology 110, 12–22 (2019)
  • Christodoulou, E., Reinke, A., Andrè, P., Godau, P., Kalinowski, P., Houhou, R., Erkan, S., Sudre, C.H., Burgos, N., Boutaj, S., et al.: False promises in medical imaging ai? assessing validity of outperformance claims. arXiv preprint arXiv:2505.04720 (2025)
  • Christodoulou, E., Reinke, A., Houhou, R., Kalinowski, P., Erkan, S., Sudre, C.H., Burgos, N., Boutaj, S., Loizillon, S., Solal, M., et al.: Confidence intervals uncovered: Are we ready for real-world medical imaging ai? In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 124–132. Springer (2024)
  • Kardas, M., Czapla, P., Stenetorp, P., Ruder, S., Riedel, S., Taylor, R., Stojnic, R.: Axcell: Automatic extraction of results from machine learning papers. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 8580–8594 (2020)
  • Simkó, A.: Automating reproducibility in medical imaging with deep learning. In: Medical Imaging with Deep Learning-Short Papers (2025)
  • Simko, A., Garpebring, A., Jonsson, J., Nyholm, T., Löfstedt, T.: Reproducibility of the methods in medical imaging with deep learning. arXiv preprint arXiv:2210.11146 (2022)
  • Singh, M., Sarkar, R., Vyas, A., Goyal, P., Mukherjee, A., Chakrabarti, S.: Automated early leaderboard generation from comparative tables. In: Advances in Information Retrieval: 41st European Conference on IR Research, ECIR 2019, Cologne, Germany, April 14–18, 2019, Proceedings, Part I 41. pp. 244–257. Springer (2019)