Public Datasets

Studying public datasets in scientific literature

Description

Medical imaging papers often focus on methodology, but the quality of the algorithms and the validity of the conclusions are highly dependent on the datasets used. As creating datasets requires a lot of effort, researchers often use publicly available datasets. In this project we analysed how these public medical datasets are used and documented in the scientific literature. Our findings showed the concentration of research papers on few publicly available datasets and the lack of standard to acknowledge the usage of the datasets. We also created open-source tools to collect citations and mentions of datasets in scientific papers.

Several students have done work related to this project:

  • Ahmet Akkoç studied how datasets are cited and mentions in pmlr venues
  • Stinna Winther did a systematic review to identify and study chest x-ray datasets
  • Christine Lyngbye Galsgaard carried qualitative and quantitative analysis on datasets, affiliations, references and ethics. She for example showed an increase in the usage of public over private datasets
  • Caroline studied URLs in NeuroImage papers from 2022, focusing on their patterns, contexts, and usage to unravel data use and reuse dynamics in neuroimaging. She revealed a high inclusion of URLs in the articles with only a minority link to the specific datasets
  • Yasmin Sarkhosh made a systematic review of research presented at the 2023 MICCAI conference to analyze the datasets used and understand the extent of demographic data incorporation during model training

People

Théo Sourget, Veronika Cheplygina.

Funding

Danish Data Science Academy DDSA-V-2022-004 and DFF (Independent Research Council Denmark) Inge Lehmann 1134-00017B

References


  1. citation.png
    Théo Sourget, Ahmet Akkoç , Stinna Winther , Christine Lyngbye Galsgaard , Amelia Jiménez-Sánchez, and 3 more authors
    In Medical Imaging with Deep Learning , 2024