Need some clarification regarding "from picai_prep import MHA2nnUNetConverter"

Need some clarification regarding "from picai_prep import MHA2nnUNetConverter" ¶

By: melhzy on June 26, 2022, 9:31 p.m.

Hello,

I am running the baseline data proprocessing. Everything went just smoothly. Here I have a question.

 from picai_prep import MHA2nnUNetConverter

archive = MHA2nnUNetConverter(
input_path="/input/images",
annotations_path="/input/labels/csPCa_lesion_delineations/human_expert/resampled",  # defaults to input_path
output_path="/workdir/nnUNet_raw_data",
settings_path="/workdir/mha2nnunet_settings.json",
)

archive.convert()

Output directory set to /workdir/nnUNet_raw_data, writing log to /workdir/nnUNet_raw_data/picai_prep_20220626171037.log

Provided mha2nnunet archive is valid.

Starting preprocessing script for Task2201_picai_baseline.
Reading scans from /input/images
Reading annotations from input/labels/csPCa_lesion_delineations/human_expert/resampled

Creating preprocessing plan with 1295 items.

Preprocessing plan completed, with **1295** items containing a total of **3885** scans and **1295** labels.

From the raw data perspective, each label is associated with 3-5 .mha files. After I have executed the preprocessing algorithm, it seems each label is associated with 3 .nii.gz files. I might missed some logics in between the raw data and preprocessed data. Could someone explain why each label is associated with 3 scans? Is it 'modality': {'0': 'T2W', '1': 'CT', '2': 'HBV'},? There are the other 2? SAG and COR? Thanks a lot.

Just came out with one more question, when I convert to U-Net dataset.

After I uncommented and saved line 79 to 90 following the U-Net data prepare instruction:

# read mha2nnunet_settings
with open(mha2nnunet_settings_path) as fp:
     mha2nnunet_settings = json.load(fp)

 note: modify preprocessing settings here
 mha2nnunet_settings["preprocessing"]["spacing"] = [3.0, 0.5, 0.5]
 mha2nnunet_settings["preprocessing"]["matrix_size"] = [20, 256, 256]

# save mha2nnunet_settings
 with open(mha2nnunet_settings_path, "w") as fp:
     json.dump(mha2nnunet_settings, fp, indent=4)

 print(f"Saved mha2nnunet settings to {mha2nnunet_settings_path}")

Then, I ran "python src/picai_baseline/prepare_data.py". I got the following error message:

Processed 1294 items, with -1 cases skipped and 1 error. Did not generate dataset.json, as only 1294/1295 items are converted successfully.

What might be that 1 error item?

Last edited by: anindo on Aug. 15, 2023, 12:56 p.m., edited 9 times in total.

Re: Need some clarification regarding "from picai_prep import MHA2nnUNetConverter" ¶

By: joeran.bosma on June 27, 2022, 3:07 p.m.

Hi Z. Huang,

Great to hear most of the steps went smoothly!

Our baseline solutions use the three axial sequences, which are available for all cases: 1. Axial T2-weighted imaging (T2W): [patient_id]_[study_id]_t2w.mha 2. Axial computed high b-value (≥ 1000 s/mm2) diffusion-weighted imaging (DWI): [patient_id]_[study_id]_hbv.mha 3. Axial apparent diffusion coefficient maps (ADC): [patient_id]_[study_id]_adc.mha

After resampling the high b-value and apparent diffusion coefficient maps to the spatial resolution of the axial T2-weighted image, each voxel corresponds to the same physical part of the prostate/body. This means the images line up when they are concatenated and inputted to an architecture like the U-Net (up to registration errors). As such, each of the "matrix elements" that make up the network's input, contain the intensity values of the three different sequences from the same physical location. This is comparable to how the red, green and blue channels of a photo produce a color image, as long as they are not moved with respect to each other.

The remaining two sequences (which are available for most cases, but not all), have a different physical orientation to the three above, and are not used by our baseline solution: 4. Sagittal T2-weighted imaging: [patient_id]_[study_id]_sag.mha 5. Coronal T2-weighted imaging:[patient_id]_[study_id]_cor.mha

Simply concatenating these images to the axial images would mean that the intensity values in a certain "matrix element" does not correspond to the same physical location in each of the five sequences. As such, including these differently oriented sequences most likely requires a custom architecture, which we did not use for the baseline algorithms. If you would like to learn more about this, you can check out for example: Taiping Qu, et. al, "M3Net: A multi-scale multi-view framework for multi-phase pancreas segmentation based on cross-phase non-local attention", Medical Image Analysis

The choice to include the three axial sequences is made implicitly in prepare_data.py#L73: generate_mha2nnunet_settings(...) This function is defined within the picai_prep repository. Multi-view preprocessing (so axial, sagittal and coronal) is not supported by picai_prep, because there is no clear "best" way to do this. We leave it to the participants to implement this in conjunction with the architectural choices that come with it.

The choice for {'0': 'T2W', '1': 'CT', '2': 'HBV'} in the dataset.json does the following: - For the first sequence (which is the axial T2-weighted scan), use instance-wise z-score normalisation - For the second sequence (which is the axial ADC scan), use dataset-wise z-score normalisation - For the third sequence (which is the axial high b-value scan), use instance-wise z-score normalisation

nnU-Net uses CT to indicate a sequence should be normalised with dataset-wise mean and variance, rather than instance-wise mean and variance. We did this, because the values in an ADC scan are diagnostically relevant. Please see the nnU-Net paper for more details, including their choice to use foreground pixels to determine dataset-wise normalisation statistics and 0.5% and 99.5% percentiles to clip values when performing dataset-wise z-score normalisation.

We got the same error as you when training the baseline algorithms. The error is caused by a single case which has a peculiar resampling effect. The case in question, 11475_1001499, is resampled from a voxel spacing of (0.56, 0.56, 3) to (0.5, 0.5, 3). During this resampling, the annotation somehow breaks into two components, as indicated by the error message (from the conversion log): AssertionError: Label has changed due to resampling/other errors for 11475_1001499! Have 1 -> 2 isolated ground truth lesions

When resampling, picai_prep checks the number of non-touching components, and throws an error when this number changes. We do not have a recommended way of handling this one case. Personally, we excluded this case when training the U-Net baseline.

Hope this helps, Joeran

P.S.: we have now released the AI-derived annotations for all 1500 cases (see picai_labels). You can include those to leverage the remaining 205 cases with csPCa as well. Our cross-validation results were shaky, but we noticed substantial performance improvements over the supervised baseline models on the Open Development Phase - Validation and Tuning Leaderboard. Maybe this annotation can also solve the issue with case 11475_1001499.

You can use prepare_data_semi_supervised.py to prepare your data when including the AI-derived annotations. This script includes the combination of human-expert and AI-derived annotations, has updated task name and paths. If you add if "11475_1001499" in fn: continue to prepare_data_semi_supervised.py#L79, you skip the human-expert annotated annotation in favor of the AI-derived one.

Last edited by: joeran.bosma on Aug. 15, 2023, 12:56 p.m., edited 2 times in total.

Re: Need some clarification regarding "from picai_prep import MHA2nnUNetConverter" ¶

By: melhzy on June 27, 2022, 10:02 p.m.

Thanks for letting me know the root cause of the error.

When you talk about "CT", do you mean "computerized tomography (CT) images"? Then, you said "We did this, because the values in an ADC scan are diagnostically relevant". Does it mean we use ADC as the equivalent of CT mentioned in the nnU-Net paper? Here we use the term as CT, but actually they are ADC scans?

I have modified the spacing to [3.0, 0.6, 0.6] from [3.0, 0.5, 0.5] as follows:

# note: modify preprocessing settings here
mha2nnunet_settings["preprocessing"]["matrix_size"] = [20, 256, 256]
mha2nnunet_settings["preprocessing"]["spacing"] = [3.0, 0.6, 0.6]
Creating preprocessing plan with 1295 items.

Preprocessing plan completed, with 1295 items containing a total of 3885 scans and 1295 labels.

100%|██████████| 1295/1295 [19:38<00:00, 1.10it/s] Processed 1295 items, with 0 cases skipped and 0 errors.

It seems all the 1295 items were processed successfully. Does this spacing change violate any medical imaging processing rules? I am a pure data science guy without medical background. Please help me to clarify this question. Thanks.

Last edited by: melhzy on Aug. 15, 2023, 12:56 p.m., edited 3 times in total.

Re: Need some clarification regarding "from picai_prep import MHA2nnUNetConverter" ¶

By: melhzy on June 29, 2022, 12:02 a.m.

Well, I reinstalled and reran everything again today.

Program started at 2022-06-28T19:29:00.218780

Output directory set to /Documents/prostate/workdir/nnUNet_raw_data, writing log to /Documents/prostate/workdir/nnUNet_raw_data/picai_prep_20220628192900.log

Provided mha2nnunet archive is valid.

Starting preprocessing script for Task2201_picai_baseline. Reading scans from /Documents/prostate/input/images Reading annotations from input/labels/csPCa_lesion_delineations/human_expert/resampled

Creating preprocessing plan with 1294 items.

Preprocessing plan completed, with 1294 items containing a total of 3882 scans and 1294 labels.

8%|▊ | 106/1294 [01:38<16:54, 1.17it/s]MetaImage: M_ReadElementsData: data not read completely ideal = 4329895 : actual = 4308889 100%|██████████| 1294/1294 [20:30<00:00, 1.05it/s] Processed 1293 items, with -1 cases skipped and 1 error.

Did not generate dataset.json, as only 1293/1294 items are converted successfully.

Program ended at 2022-06-28T19:49:31.414137 (took 0:20:31.195357)

Saving dataset info to /Documents/prostate/workdir/nnUNet_raw_data/Task2201_picai_baseline/dataset.json

It says 1294 items containing a total of 3882 and 1294 labels. However, one conversion stopped at 8% and the final output counts are 3879 training images with 1293 labels.

It looks like I got 219 labeled data now. Is it what the scripts supposed to do?

Last edited by: melhzy on Aug. 15, 2023, 12:56 p.m., edited 2 times in total.

Re: Need some clarification regarding "from picai_prep import MHA2nnUNetConverter" ¶

By: joeran.bosma on June 29, 2022, 12:24 p.m.

Hi Z. Huang,

When you talk about "CT", do you mean "computerized tomography (CT) images"? Then, you said "We did this, because the values in an ADC scan are diagnostically relevant". Does it mean we use ADC as the equivalent of CT mentioned in the nnU-Net paper? Here we use the term as CT, but actually they are ADC scans?

That is spot on!

Does this spacing change violate any medical imaging processing rules?

Resampling to a different spacing does not validate any medical rules, but it can introduce aliasing artefacts. The dataset contains many different spatial resolutions. For axial T2W scans, the most common voxel spacing (in mm/voxel) is 3.0×0.5×0.5 (43%), followed by 3.6×0.3×0.3 (25%), 3.0×0.342×0.342 (15%) and others (17%). As such, the previous choice of 0.5x0.5 mm in-plane meant that 43% of axial T2-weighted scans did not have to be resampled. Additionally, many of the diffusion scans ( _adc.mha and _hbv.mha) have an in-plane resolution of 2.0x2.0 mm/voxel, so a factor of four between the original and output spacing. I think this preserves the clinical information pretty well, but there may be more optimal resampling strategies. I think the in-plane spacing you went for (0.6x0.6) is a good alternative, and it would require an experiment to determine which is best.

It says 1294 items containing a total of 3882 and 1294 labels. However, one conversion stopped at 8% and the final output counts are 3879 training images with 1293 labels.

Looking at the error message (MetaImage: M_ReadElementsData: data not read completely. ideal = 4329895 : actual = 4308889), it appears one of the training scans got corrupted. I think this is an issue at your side, and I would advice to remove the scans of the specific case (the case ID should be in the conversion log, or you can check which case out of the 1294 did not get converted otherwise). Unzipping that case again should solve the issue.

Hope this helps, Joeran