Re: Public Training and Development Dataset: Updates and Fixes ¶
By: kreininy on March 18, 2025, 12:15 p.m.
I succeeded to unzip 4069 mha-files. Thank you
By: bogdanobreja on March 18, 2025, 12:36 p.m.
Great! Thanks for letting us know about these issues. Hopefully it will help other participants as well.
Kind Regards, Bogdan.
By: sriramgs on April 2, 2025, 12:39 p.m.
Hi,
I have downloaded the dataset and tried to unzip it using 7z in Linux using the below command:
7z x /LUNA/luna25_images.zip.001
At the end of the extraction itself, I got 4096 files. I have not executed similar commands for other zip files, 'zip.002' to 'zip.046'. Does the 7z
automatically identify the remaining subsets of data and extract all the files?
Would you please confirm whether my approach for extracting zip files is correct?
Thanks, Sriram.
By: drepeeters on April 2, 2025, 2:21 p.m.
Dear Sriram,
After extracting the luna25_images.zip.XXX files
, the training dataset should contain 4069 CT scans in total.
Since you don't report any error messages, I assume the extraction of the data was succesfull.
It's a bit difficult to tell whether there's a real mismatch in your case, or if there was a typo in the number you reported (4069 versus 4096). If you indeed have more CT scans than expected, I’d recommend checking for duplicate files and matching the filenames against the SeriesInstanceUID column in the provided .csv file to verify consistency.
Please keep us updated on your issue and let me know if you need further assistance.
Kind regards,
Dre
By: sriramgs on April 3, 2025, 11:59 a.m.
Dear Dre,
It was indeed a typo, and there are 4069 images. Sorry for the confusion.
I can also see a separate folder, 'luna25_nodule_blocks', which contains image and metadata folders with 6163 files (.npy), each in two folders. Would you please tell me what these files represent?
Thanks, Sriram.
By: bogdanobreja on April 3, 2025, 1:52 p.m.
Hi Sriram,
For this challenge, we offer two approaches to train your algorithms for the nodule malignancy risk estimation task:
We provide baseline code (available at https://github.com/DIAGNijmegen/luna25-baseline-public ) as a helpful starting point. The baseline implementation uses the provided nodule blocks for training two AI algorithms.
Kind Regards, Bogdan Obreja.
By: drepeeters on April 3, 2025, 2:04 p.m.
Dear Sriram,
Great that you were able to succesfully extract the data. I can provide some additional data next to Bogdan's response.
The luna25_nodule_blocks
indeed contains the image
and metdata
folders.
The image
folder contains .npy
files, each representing a 3D volume with shape 64, 128, 128
in (z, y, x)
order. These were extracted from the original CT scans and centered on a specific nodule.
The metdata
folder contains .npy
files containing a dictionary with the origin
, spacing
, and transform
from the original CT scan (.mha) file.
You can use both files together with our baseline script to start training our baseline algorithm. If you want to take a look at the scripts that were used to extract the files, you can take a look here: https://github.com/DIAGNijmegen/luna25-baseline-public/tree/main/preprocessing
Hope this clarifies the utility of these files. Feel free to reach out with further questions.
Kind regards,
Dre
By: sriramgs on April 4, 2025, 12:57 p.m.
Dear Dre / Bogdan,
I went through the baseline GitHub repo. For inference, it is mentioned that CT image
, nodule locations.json
and clinical information.json
files are needed. We were provided with LUNA25_Public_Training_Development_Data.csv
.
a) Please tell me whether we should create nodule locations.json
and clinical information.json
files for each study based on the training CSV file.
b) Also, there is no info for smoking_status
, clinical_category
, regional_nodes_category
, and metastasis_category
in the training CSV to be used in clinical information.json
. Please tell me whether we should consider NULL
for these categories by default.
c) I can see that the sample nodule locations.json
provided in GitHub is of Multiple points type. Please tell me whether the single-point type too follows the same format.
Thanks, Sriram.
By: drepeeters on April 4, 2025, 3:55 p.m.
Dear Sriram,
Indeed, the inference pipeline for the LUNA25 Challenge expects three types of input files:
The CT image: /input/images/chest-ct/<uid>mha
A nodule coordinate file: /input/nodule-locations.json
A clinical information file: /input/clinical-information-lung-ct.json
You can find examples of each input type in the GitHub repository here: https://github.com/DIAGNijmegen/luna25-baseline-public/tree/main/test/input
These examples allow you to test your Docker container locally after training your algorithm. The Grand Challenge platform will provide inputs in exactly this structure during evaluation on the hidden testing data. Therefore, you do not need to generate these files from the training CSV.
The clinical-information-lung-ct.json
does not include data for the other categories since we do not want to encourage the AI models to rely on such data.
The format of the nodule-locations.json
will always contain "type": "Multiple points". This indicates that the file may contain coordinates for one or more nodules. If there is only one nodule in the scan, the format remains the same, but with only one set of coordinates in the list.
To summarize, you do not need to create these files yourself. Your model only needs to read and process the inputs provided in our format during inference.
Kind regards,
Dre Peeters