Public Training and Development Dataset: Updates and Fixes

Public Training and Development Dataset: Updates and Fixes  

  By: anindo on May 13, 2022, 12:05 a.m.

The PI-CAI: Public Training and Development Dataset, consisting of 1500 multi-center, multi-vendor cases, is now online! Learn more about all the considerations that we've made to curate and release the all-new largest public training dataset for prostate cancer detection in MRI: pi-cai.grand-challenge.org/DATA/. Please monitor this thread for all updates and fixes regarding this dataset.

Imaging data has been released via: zenodo.org/record/6624726 (DOI: 10.5281/zenodo.6624726). Annotations have been released and are maintained via: github.com/DIAGNijmegen/picai_labels.

Updates since v1.0:

  • Diffusion b-value for all high b-value DWI scans, as present in the DICOM attribute (0018,9087).
  • File 10121_1000121_t2w.mha (i.e. T2W imaging for study ID 1000121, under patient ID 10121) was corrupted.
  • Folder 10403 (including all imaging for study ID 1000409, under patient ID 10403) was missing.
  • Clinical outcome lesion_GS for study 1001040 under patient 11020, as stated in the overall clinical information marksheet, was missing.
  • High b-value DWI scan for study 1000715 under patient 10699, was incorrect.
  • Clinical variable patient_age, as stated in the metadata/header of each MRI scan and the overall clinical information marksheet, was inconsistent or incorrect for nearly half of all training cases.
  • Intensity values for Philips-based MRI scans were incorrectly rescaled during conversion from DICOM to MHA (see #1766 for more details).

Pending Updates (scheduled to be added in 2-4 weeks):

Pending Fixes: None at the moment. Please do not hesitate to let us know if you come across any other issues!

 Last edited by: anindo on Aug. 15, 2023, 12:56 p.m., edited 33 times in total.

Re: Public Training and Development Dataset: Updates and Fixes  

  By: enslay on May 25, 2022, 12:08 a.m.

Hello, thanks for organizing this challenge! This is a massive data set! Your Philips Ingenia scanner has tricked your data preprocessor to scale ADC in a strange way (off by a factor of 1000). There's a rescale/intercept tag in the original DICOM that is causing your DICOM reader to do something strange to the ADC images. I've seen this before (although I don't remember the scanners responsible for it!).

Examples: 10541 10142 10615 10173

If you're using ITK, you can query GetInternalComponentType() and have the ImageSeriesReader use that pixel type. If you're using SimpleITK you can use SetOutputPixelType in ImageSeriesReader to sitk.sitkInt16 to force it to ignore the rescale tags. I don't think you can directly access the GDCMImageIO object in SimpleITK.

This is briefly documented for ITK: https://itk.org/Doxygen/html/classitk_1_1GDCMImageIO.html#acad07a68b36324f1d8d2ac87fe957838

The GDCMImageIO is the only ImageIO that has this extra functionality! And it's there because the ordinary GetComponentType() otherwise honors rescale tags and reports that the pixel type is floating point when those tags are present.

I would also appreciate the tag (0018,9087) for the hbv images. But I can live without it.

Re: Public Training and Development Dataset: Updates and Fixes  

  By: joeran.bosma on May 25, 2022, 9:03 a.m.

Hi Nathan, thanks for bringing this to our attention! Great to have such a detailed description and solution. We will address this issue in the upcoming data update at the end of this month.

Also, we’ll look into the b-value tag 0018,9087, and see if we can incorporate this information.

Lack of some coronal and saggital series  

  By: jakub.mitura14 on June 9, 2022, 7:27 a.m.

In some series I can not find some coronal and saggital series - part of it is mentioned above - Is it consistent in case of others?

no cor in 1000792 no cor in 1000960 no cor in 1001240 no sag in 1000116 no sag in 1000707 no sag in 1000792 no sag in 1000960

Additionally is there access to original dicom series or only MHA ?

Re: Public Training and Development Dataset: Updates and Fixes  

  By: anindo on June 9, 2022, 9:49 a.m.

Hi Jakub Mitura,

Coronal and sagittal sequences are optional sequences that are not present for all cases in the training datasets. Please have a look at this forum thread for more details.

Hope this helps.

Re: Public Training and Development Dataset: Updates and Fixes  

  By: joeran.bosma on June 9, 2022, 10:30 a.m.

Hi Jakub Mitura,

The original DICOM series will not be released, only their MHA counterparts.

Kind regards, Joeran

Re: Public Training and Development Dataset: Updates and Fixes  

  By: jakub.mitura14 on June 10, 2022, 1:19 p.m.

All clear thanks!

Re: Public Training and Development Dataset: Updates and Fixes  

  By: joeran.bosma on June 12, 2022, 4:44 a.m.

We have released version 2.0 of the PI-CAI: Public Training and Development Dataset! The updated dataset can be downloaded from Zenodo. The updated annotations can be found here: github.com/DIAGNijmegen/picai_labels.

Changes with respect to version 1.0:

  • Philips Insignia scans were incorrectly rescaled using their rescale slope and intercept. This caused their intensities values to be a factor of approx. 2 or 2000 times smaller than intended. This issue has been fixed now.
  • Patient age was off by some years for a subset of cases. This has been corrected now. See the clinical marksheet or DICOM tag 0010,1010 for the patient’s age.
  • All high b-value scans now carry the b-value at which the scan was calculated or acquired. See DICOM tag 0018,9087 in each respective [patient_id]_[study_id]_hbv.mha scan.
  • Cases 10147_1000149 and 10551_1000563 have been replaced. These cases were duplicates of each other and coincidentally had imaging artefacts induced by hip prostheses. Two new cases without any patient overlap with other cases have been added (at the same PI-CAI IDs as the replaced scans).
  • The high b-value scan of patient 10699, case 1000715 ( 10699_1000715_hbv.mha) has been corrected.
  • Metadata for registered axial T2-weighted scans was missing, this is fixed now.
  • The corrupted axial T2-weighted scan of patient 10121, case 1000121 ( 10121_1000121_t2w.mha) is now fixed.
  • Case 1000409 of patient 10403 was missing and has been included now.
  • The lesion_GS field was missing for patient 101020, case 1001040. This has been added to the marksheet.

This addresses all issues raised by the community or internally so far. Please do not hesitate to let us know if you come across any other issues!

 Last edited by: anindo on Aug. 15, 2023, 12:56 p.m., edited 1 time in total.

Re: Public Training and Development Dataset: Updates and Fixes  

  By: joeran.bosma on June 14, 2022, 10:55 a.m.

The annotation for study 1000707 of patient 10691 was resampled to the wrong T2-weighted scan (as identified by Christoph). I have resampled this annotation to the correct scan now, please update picai_labels to fetch the correct annotation.

Re: Public Training and Development Dataset: Updates and Fixes  

  By: enslay on June 15, 2022, 3:43 p.m.

Nice job! Thanks for b-value tags and ADC fix.

I've encountered another issue with ADC where the rescaling tags may have been used properly before. For example, compare V1 and V2 ADCs of 10847 and 11176.

Do the V2 ADC values look physically correct? They don't match up with otherwise reasonable-looking V1 ADC values.

If not, and with no effort required from PICAI organizers: One way to work around all ADC issues is to load the V1 ADC and do something like this:

npAdc = sitk.GetArrayViewFromImage(adc) if npAdc.max() > 1000: pass elif npAdc.max() > 1: print("Weirdo ADC!") npAdc = 1000 * npAdc # Guessing scalar factor is 1000 in weird cases

Sorry indentation is lost.

Re: Public Training and Development Dataset: Updates and Fixes  

  By: joeran.bosma on June 15, 2022, 7:18 p.m.

Hi Nathan,

Thanks for checking the ADC values once more! I've checked the two cases (10847_1000863 and 11176_1001199), which are both from a Philips Ingenia scanner. The rescale slope for both cases was ~0.53 (as seen from the intensity ratios between the v1 and v2 scans), meaning that both version are relatively close to each other.

To me (not a radiologist), the ADC values in version 2 seem more typical. The ADC value (within the prostate) is typically above 850, while the ADC of 10847_1000863 (v1) is below this threshold for large parts of the prostate. For 11176_1001199 (v1) the ADC is also very low for some parts of the prostate (< 500). Meanwhile, neither have csPCa.

Below I have plotted the mean intensity values of the ADC scans within the prostate gland (using these prostate segmentations). Based on this, the rescaled ADC intensities seem more reasonable in v2 than in v1 as well.

Therefore, I think it is safe to say the intensity values are correct in version 2 of the dataset (luckily).

P.S.: I did observe rescale slopes of both ~1/2 and ~1/2000, which may have motivated your choice for rescaling small intensity values with a factor of 1000?

Hope this helps, Joeran

Re: Public Training and Development Dataset: Updates and Fixes  

  By: enslay on June 15, 2022, 8 p.m.

Hi Joeran, Thanks for double checking! Those rescale values are even weirder than I imagined. I thought Philips was just doing something simple like dividing by 1000 for some of these weird ADC images. I don't know why some scanners modify the ADC like this, but I've seen it happen before.