Public Training and Development Dataset: Updates and Fixes

Re: Public Training and Development Dataset: Updates and Fixes ¶

By: kreininy on March 18, 2025, 12:15 p.m.

I succeeded to unzip 4069 mha-files. Thank you

Re: Public Training and Development Dataset: Updates and Fixes ¶

By: bogdanobreja on March 18, 2025, 12:36 p.m.

Great! Thanks for letting us know about these issues. Hopefully it will help other participants as well.

Kind Regards, Bogdan.

Re: Public Training and Development Dataset: Updates and Fixes ¶

By: sriramgs on April 2, 2025, 12:39 p.m.

Hi,

I have downloaded the dataset and tried to unzip it using 7z in Linux using the below command:

7z x /LUNA/luna25_images.zip.001

At the end of the extraction itself, I got 4096 files. I have not executed similar commands for other zip files, 'zip.002' to 'zip.046'. Does the 7z automatically identify the remaining subsets of data and extract all the files?

Would you please confirm whether my approach for extracting zip files is correct?

Thanks, Sriram.

Re: Public Training and Development Dataset: Updates and Fixes ¶

By: drepeeters on April 2, 2025, 2:21 p.m.

Dear Sriram,

After extracting the luna25_images.zip.XXX files, the training dataset should contain 4069 CT scans in total. Since you don't report any error messages, I assume the extraction of the data was succesfull.

It's a bit difficult to tell whether there's a real mismatch in your case, or if there was a typo in the number you reported (4069 versus 4096). If you indeed have more CT scans than expected, I’d recommend checking for duplicate files and matching the filenames against the SeriesInstanceUID column in the provided .csv file to verify consistency.

Please keep us updated on your issue and let me know if you need further assistance.

Kind regards,
Dre

Re: Public Training and Development Dataset: Updates and Fixes ¶

By: sriramgs on April 3, 2025, 11:59 a.m.

Dear Dre,

It was indeed a typo, and there are 4069 images. Sorry for the confusion.

I can also see a separate folder, 'luna25_nodule_blocks', which contains image and metadata folders with 6163 files (.npy), each in two folders. Would you please tell me what these files represent?

Thanks, Sriram.

Re: Public Training and Development Dataset: Updates and Fixes ¶

By: bogdanobreja on April 3, 2025, 1:52 p.m.

Hi Sriram,

For this challenge, we offer two approaches to train your algorithms for the nodule malignancy risk estimation task:

Full CT Approach (MHA format): Train your algorithm using the complete CT scans as input.
Nodule Blocks Approach: Train your algorithm using pre-cropped nodule blocks. These blocks are specifically centered around nodules annotated by our radiologists, extracted directly from the full CT scans. This method is particularly beneficial when GPU resources are limited.

We provide baseline code (available at https://github.com/DIAGNijmegen/luna25-baseline-public ) as a helpful starting point. The baseline implementation uses the provided nodule blocks for training two AI algorithms.

Kind Regards, Bogdan Obreja.

Re: Public Training and Development Dataset: Updates and Fixes ¶

By: drepeeters on April 3, 2025, 2:04 p.m.

Dear Sriram,

Great that you were able to succesfully extract the data. I can provide some additional data next to Bogdan's response.

The luna25_nodule_blocks indeed contains the image and metdata folders.

The image folder contains .npy files, each representing a 3D volume with shape 64, 128, 128 in (z, y, x) order. These were extracted from the original CT scans and centered on a specific nodule.
The metdata folder contains .npy files containing a dictionary with the origin, spacing, and transform from the original CT scan (.mha) file.

You can use both files together with our baseline script to start training our baseline algorithm. If you want to take a look at the scripts that were used to extract the files, you can take a look here: https://github.com/DIAGNijmegen/luna25-baseline-public/tree/main/preprocessing

Hope this clarifies the utility of these files. Feel free to reach out with further questions.

Kind regards,
Dre

Last edited by: drepeeters on April 3, 2025, 2:06 p.m., edited 1 time in total.

Re: Public Training and Development Dataset: Updates and Fixes ¶

By: sriramgs on April 3, 2025, 5:35 p.m.

Thank you very much, Dre and Bogdan.

Re: Public Training and Development Dataset: Updates and Fixes ¶

By: sriramgs on April 4, 2025, 12:57 p.m.

Dear Dre / Bogdan,

I went through the baseline GitHub repo. For inference, it is mentioned that CT image, nodule locations.json and clinical information.json files are needed. We were provided with LUNA25_Public_Training_Development_Data.csv.

a) Please tell me whether we should create nodule locations.json and clinical information.json files for each study based on the training CSV file.

b) Also, there is no info for smoking_status, clinical_category, regional_nodes_category, and metastasis_category in the training CSV to be used in clinical information.json. Please tell me whether we should consider NULL for these categories by default.

c) I can see that the sample nodule locations.json provided in GitHub is of Multiple points type. Please tell me whether the single-point type too follows the same format.

Thanks, Sriram.

Re: Public Training and Development Dataset: Updates and Fixes ¶

By: drepeeters on April 4, 2025, 3:55 p.m.

Dear Sriram,

Indeed, the inference pipeline for the LUNA25 Challenge expects three types of input files:

The CT image: /input/images/chest-ct/<uid>mha
A nodule coordinate file: /input/nodule-locations.json
A clinical information file: /input/clinical-information-lung-ct.json

You can find examples of each input type in the GitHub repository here: https://github.com/DIAGNijmegen/luna25-baseline-public/tree/main/test/input

These examples allow you to test your Docker container locally after training your algorithm. The Grand Challenge platform will provide inputs in exactly this structure during evaluation on the hidden testing data. Therefore, you do not need to generate these files from the training CSV.

The clinical-information-lung-ct.json does not include data for the other categories since we do not want to encourage the AI models to rely on such data.

The format of the nodule-locations.json will always contain "type": "Multiple points". This indicates that the file may contain coordinates for one or more nodules. If there is only one nodule in the scan, the format remains the same, but with only one set of coordinates in the list.

To summarize, you do not need to create these files yourself. Your model only needs to read and process the inputs provided in our format during inference.

Kind regards,
Dre Peeters

Last edited by: drepeeters on April 4, 2025, 5:59 p.m., edited 1 time in total.

Re: Public Training and Development Dataset: Updates and Fixes ¶

By: sriramgs on April 5, 2025, 8:20 a.m.

Thank you very much, Dre