Data Preparation and Training Resources

Data Preparation and Training Resources  

  By: capparella.1746513 on July 12, 2022, 6:25 p.m.

Hello!

This is my very first "Medical Challenge" I joined, and I did it for research, and as part of my Master Thesis, so I'm working solo and with limited resources: for these reasons I have some doubts about data preparation and training resources:

1) I saw from MIC-DKFZ nnUNet GiHub page that nnUNet requires both a folder structure and the dataformat close to the ones used for Medical Segmentation Dechathlon (MSD). Indeed I see some similarities with yours (as shown in picai_baseline and picai_prep), but then I saw also some parts that did not match:

from MIC-DKFZ instructions of dataset conversion:

"...Imaging modalities are identified by nnU-Net by their suffix: a four-digit integer at the end of the filename. Imaging files must therefore follow the following naming convention: case_identifier_XXXX.nii.gz. Hereby, XXXX is the modality identifier..." and they refer explicitly to the MSD 'BrainTumor' task modalities, i.e.: FLAIR (0000), T1w (0001), T1gd (0002) and T2w (0003). Their codes (in particular 0000, 0001 and 0002) match the ones created for the files in 'imagesTr' (created with prepare_data.py), but do not match the actual modalities used in this challenge. So: did I set 'prepare_data' setting in a bad way? Am I missing something? Can these codes have arbitrary meaning according to the challenge?

2) I have very limited resources ( GTX 1050, 8G, mobile version) and I have not been guaranteed other resources, so I was wondering if keep going with this challenge would have been possible to me: do you have any benchmark, expected minum requirements and so on?

3) When I ran the 'prepare_data.py' it took nearly 2hrs to complete the archive generation: is it normal? Is it due to my resources or some bad setting?

Thanks in advance for your help, hope to hear from you soon!

Mattia

Re: Data Preparation and Training Resources  

  By: joeran.bosma on July 13, 2022, 10:30 a.m.

Hi Mattia,

Welcome to the PI-CAI challenge!

1) We indeed follow the same structure as nnU-Net/MSD. The modality codes do not pertain to specific MR/CT modalities, but should be consistent throughout the dataset. In our case, we include the axial T2-weighted (T2W/0000), apparent diffusion coefficient (ADC/0001) and high b-value (HBV/0002) scans. This is described in the dataset.json, and comes from the generate_mha2nnunet_settings function within picai_prep: "modality": { "0": "T2W", "1": "CT", "2": "HBV" },

So indeed, the codes can have arbitrary meaning. Please note that we gave the ADC scans a different name, “CT”, to trigger nnU-Net’s dataset-wise normalization (see their paper for more details).

2) Training the baseline models from scratch takes a long time, even on an RTX 2080 Ti. For us, this took about 1.5 weeks for all five nnU-Net folds, 1 week for all five nnDetection folds and 1 week for all five U-Net folds. Although your GTX 1050 would be able to train these models (8GB VRAM is sufficient), this will take multiple weeks, which is probably infeasible.

I think it is still possible to participate in this challenge, by leveraging pretrained models. You can fine-tune the baseline models, which are provided in their associated GitHub repositories: 1. nnDetection (semi-supervised) 2. U-Net (semi-supervised) 3. nnU-Net (semi-supervised) 4. nnU-Net (supervised) 5. U-Net (supervised) 6. nnDetection (supervised)

See e.g. this GitHub issue on fine-tuning nnU-Net.

Another way to cut down compute is to train a single fold only, which is often sufficient for method development.

Some adaptations you could try which require little computational resources include: - Crop to the prostate, using the provided automatic prostate segmentations. - Register different sequences and fine-tune a baseline solution (a significant part of the PI-CAI Training and Development dataset contains cases with severe misalignment. Well-aligned studies may improve model convergence. When registering, we advise to move the annotation along with the ADC/HBV sequence. Note: we manually registered validation and test cases.)

3) Depending on your hardware and setup (local HDD/SDD or network share), this can vary. A dataset conversion time of 2hrs is not exceptionally long. You can monitor your CPU/IO speed to identify any bottleneck, if present.

Hope this helps, Joeran

Re: Data Preparation and Training Resources  

  By: capparella.1746513 on July 17, 2022, 11:39 a.m.

Thank you for the clarifications and tips! I'll try to do my best !

Re: Data Preparation and Training Resources  

  By: joeran.bosma on Aug. 12, 2022, 10:26 a.m.

Hi Mattia,

I came acros the Lifelong-nnUNet repository, which may be interesting for Sequential Training, as this is essentially the same as fine-tuning the provided baseline models.

Hope this helps, Joeran