Hi Flute,
Thanks for your question, I think it is an important discussion and I am glad you asked.
I think there is some misunderstanding which was probably caused by my previous post regarding the test and training data. Let me try to clarify it.
Both training and test data are chest x-ray data. Training data comes from multiple hospitals (at least from four different hospitals), and test data also comes from multiple hospitals (different from the ones where training data was acquired)( https://node21.grand-challenge.org/Data/). So, they are not different except that the training and test sets come from different hospitals. While annotating the test data, we also provided the corresponding CT scans of the same patients as well as the CXR images to the radiologists in order to increase the annotation quality. So CT data was only used to verify the nodule annotations as supplementary information. For annotation of the training data, we only provided the CXR images since their corresponding CT data was not available.
So, we perform an external validation of the models submitted to NODE21: how the model will behave when evaluated on an external set. When we develop a model, we aim to implement this system in different hospitals, therefore, external validation is desirable to have a more realistic performance measure. (and highly encouraged while writing research papers, too:))
It is expected that the performance on the test data would be lower than the validation set. For example, the baseline method achieved a higher than 90% AUC score on our validation set, but, achieved 84% AUC on the experimental test set. I am curious, does your model also achieve that high performance on the validation set?
Regarding your comment on 'having multi-center issues', I think the baseline results already show that we can achieve high performance using this training data and it generalizes well on the test set. Other participants also were able to achieve a performance level similar to or higher than this using NODE21 training data. In all the applications we develop using machine learning, we could always argue that with more training examples the results could further improve. Because it might not be possible to collect a large number of training images that can cover all different nodule characteristics (shape, density, location of nodules etc.) . But that is a different question than the issue in generalizability and that is attempted to be addressed through the generation track :)
I hope this clarifies things a bit. Let us know if anything is unclear.
Best, Ecem