potential multi-center issue

potential multi-center issue  

  By: Flute on Jan. 3, 2022, 2:58 a.m.

Hi, I am writing to see if it is appropriate to ask more information about the original arrangement of training and hidden test1 + test2 sets while setting up this contest.

I am wondering while setting the given training and test sets, do you guys mix together, or say balance out, the collected data from different sources in the training and test sets? I saw one discussion says that the training is x-ray data the test is generated from CT.

The reason why I am having this question is becuase while I am working on this project, though my model could perform really well on my split of validation set, it never worked after uploaded and tested on the test set. Actually my results on test set sucks.

Seeing multiple of my model's behavior after uploading to GC, I am doubting if the training and test data are having multi-center issues, which in specific is that the training infomation is so much different from the test set (or the test set has some unseen information in the training set) that we could never let a algorithm generalize well on test after thoroughly learning out the training set.

This is just personally opinion and guess, forgive me if I am wrong :)

Flute

Re: potential multi-center issue  

  By: ecemsogancioglu on Jan. 4, 2022, 11:40 p.m.

Hi Flute,

Thanks for your question, I think it is an important discussion and I am glad you asked.

I think there is some misunderstanding which was probably caused by my previous post regarding the test and training data. Let me try to clarify it.

Both training and test data are chest x-ray data. Training data comes from multiple hospitals (at least from four different hospitals), and test data also comes from multiple hospitals (different from the ones where training data was acquired)( https://node21.grand-challenge.org/Data/). So, they are not different except that the training and test sets come from different hospitals. While annotating the test data, we also provided the corresponding CT scans of the same patients as well as the CXR images to the radiologists in order to increase the annotation quality.  So CT data was only used to verify the nodule annotations as supplementary information. For annotation of the training data, we only provided the CXR images since their corresponding CT data was not available.

So, we perform an external validation of the models submitted to NODE21:  how the model will behave when evaluated on an external set. When we develop a model, we aim to implement this system in different hospitals, therefore, external validation is desirable to have a more realistic performance measure. (and highly encouraged while writing research papers, too:))

It is expected that the performance on the test data would be lower than the validation set. For example, the baseline method achieved a higher than 90% AUC score on our validation set, but, achieved 84% AUC on the experimental test set. I am curious, does your model also achieve that high performance on the validation set?

Regarding your comment on 'having multi-center issues', I think the baseline results already show that we can achieve high performance using this training data and it generalizes well on the test set. Other participants also were able to achieve a performance level similar to or higher than this using NODE21 training data. In all the applications we develop using machine learning, we could always argue that with more training examples the results could further improve. Because it might not be possible to collect a large number of training images that can cover all different nodule characteristics (shape, density, location of nodules etc.) . But that is a different question than the issue in generalizability and that is attempted to be addressed through the generation track :) 

I hope this clarifies things a bit. Let us know if anything is unclear.

Best, Ecem

 Last edited by: ecemsogancioglu on Aug. 15, 2023, 12:55 p.m., edited 1 time in total.