Public Datasets | PURRlab @ IT University of Copenhagen

Banner Image — Information is extracted from research papers to perform meta-studies on how public datasets are used and documented.

Description

Medical imaging papers often focus on methodology, but the quality of the algorithms and the validity of the conclusions are highly dependent on the datasets used. As creating datasets requires a lot of effort, researchers often use publicly available datasets. In this project we analysed how these public medical datasets are used and documented in the scientific literature. Our findings showed the concentration of research papers on few publicly available datasets and the lack of standard to acknowledge the usage of the datasets. We also created open-source tools to collect citations and mentions of datasets in scientific papers.

Several students have done work related to this project:

Ahmet Akkoç studied how datasets are cited and mentions in pmlr venues
Stinna Winther did a systematic review to identify and study chest x-ray datasets
Christine Lyngbye Galsgaard carried qualitative and quantitative analysis on datasets, affiliations, references and ethics. She for example showed an increase in the usage of public over private datasets
Caroline Vang-Larsen studied URLs in NeuroImage papers from 2022, focusing on their patterns, contexts, and usage to unravel data use and reuse dynamics in neuroimaging. She revealed a high inclusion of URLs in the articles with only a minority link to the specific datasets
Yasmin Sarkhosh made a systematic review of research presented at the 2023 MICCAI conference to analyze the datasets used and understand the extent of demographic data incorporation during model training

People

Théo Sourget, Veronika Cheplygina.

Funding

Danish Data Science Academy DDSA-V-2022-004 and DFF (Independent Research Council Denmark) Inge Lehmann 1134-00017B

References

[Citation needed] Data usage and citation practices in medical imaging conferences

Théo Sourget, Ahmet Akkoç, Stinna Winther, Christine Lyngbye Galsgaard, Amelia Jiménez-Sánchez, and 3 more authors

In Medical Imaging with Deep Learning, 2024

URL Bib PDF Code Poster

@inproceedings{sourget2024citation,
  title = {[Citation needed] Data usage and citation practices in medical imaging conferences},
  author = {Sourget, Th{\'e}o and Akko{\c{c}}, Ahmet and Winther, Stinna and Galsgaard, Christine Lyngbye and Jim{\'e}nez-S{\'a}nchez, Amelia and Juodelyte, Dovile and Petitjean, Caroline and Cheplygina, Veronika},
  booktitle = {Medical Imaging with Deep Learning},
  year = {2024},
}