Public Training and Development Dataset: Updates and Fixes

Public Training and Development Dataset: Updates and Fixes  

  By: drepeeters on Jan. 15, 2025, 10:50 a.m.

The LUNA25: Public Training and Development dataset, consisting of over 6000 cases, is now online! ... Please monitor this thread for all updates and fixes regarding this dataset.

Example below on how to download all the MHA CT images and the nodule blocks from Zenodo using Python.To get a Zenodo Token you must make an account on their website. The full dataset consists of 46 luna25_images.zip files and 2 luna25_nodule_blocks.zip files.

import os
import requests
ACCESS_TOKEN = "YOUR ZENODO TOKEN"
record_id = "14223624" #LUNA25 record id

# Specify the output folder where files will be saved
output_folder = "YOUR-OUTPUT_PATH"
os.makedirs(output_folder, exist_ok=True)

# Get the metadata of the Zenodo record
r = requests.get(f"https://zenodo.org/api/records/{record_id}", params={'access_token': ACCESS_TOKEN})

if r.status_code != 200:
    print("Error retrieving record:", r.status_code, r.text)
    exit()

# Extract download URLs and filenames
download_urls = [f['links']['self'] for f in r.json()['files']]
filenames = [f['key'] for f in r.json()['files']]

print(f"Total files to download: {len(download_urls)}")

# Download each file
for index, (filename, url) in enumerate(zip(filenames, download_urls)):
    file_path = os.path.join(output_folder, filename)

    print(f"Downloading file {index}/{len(download_urls)}: {filename} -> {file_path}")

    with requests.get(url, params={'access_token': ACCESS_TOKEN}, stream=True) as r:
        r.raise_for_status()  # Raise an error for failed requests
        with open(file_path, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192):  # Download in chunks
                f.write(chunk)

    print(f"Completed: {filename}")

print("All downloads completed successfully!")
 Last edited by: bogdanobreja on March 18, 2025, 10:26 a.m., edited 7 times in total.

Re: Public Training and Development Dataset: Updates and Fixes  

  By: kreininy on March 13, 2025, 1:11 p.m.

I download zip file(s), but neither Arc or Xarchiver recognize these files as zip-file. Do you have any recommendation for Ubuntu?

Re: Public Training and Development Dataset: Updates and Fixes  

  By: drepeeters on March 14, 2025, 10:34 a.m.

The zip files were created as multipart zip files using 7-Zip. You can download the software here.

Open the 7-Zip application, navigate to your directory that contains the downloaded files. Here, you can select the files (e.g. luna25_nodule_blocks.zip.001 and luna25_nodule_blocks.zip.002), right-click and select "Extract files".

Please let me know if you need additional help.

Re: Public Training and Development Dataset: Updates and Fixes  

  By: kreininy on March 15, 2025, 6:54 p.m.

I am trying to unzip files on Linux (Ubuntu). That's what I get:

7zzs e luna25_images.zip.001 -o./LUNA-2025

7-Zip (z) 24.09 (x64) : Copyright (c) 1999-2024 Igor Pavlov : 2024-11-28 64-bit locale=en_CA.UTF-8 Threads:32 OPEN_MAX:1024, ASM

Scanning the drive for archives: 1 file, 4697620480 bytes (4480 MiB)

Extracting archive: luna25_images.zip.001

luna25_images.zip ERRORS: Unexpected end of archive

-- Path = luna25_images.zip.001 Type = Split Physical Size = 4697620480 Volumes = 1 Total Physical Size = 4697620480


Path = luna25_images.zip Size = 4697620480 -- Path = luna25_images.zip Type = zip ERRORS: Unexpected end of archive Physical Size = 4697620480 Characteristics = Local

ERROR: CRC Failed : luna25_images/1.2.840.113654.2.55.111642424320700321661650655704628718201.mha

Sub items Errors: 1

Archives with Errors: 1

Open Errors: 1

Sub items Errors: 1

Re: Public Training and Development Dataset: Updates and Fixes  

  By: kreininy on March 15, 2025, 7:08 p.m.

For the rest of the files the situation is even worse:

7zzs e luna25_images.zip.003 -o./LUNA-2025

7-Zip (z) 24.09 (x64) : Copyright (c) 1999-2024 Igor Pavlov : 2024-11-28
 64-bit locale=en_CA.UTF-8 Threads:32 OPEN_MAX:1024, ASM

Scanning the drive for archives:
1 file, 4565820344 bytes (4355 MiB)

Extracting archive: luna25_images.zip.003
ERROR: luna25_images.zip.003
Cannot open the file as archive


Can't open as archive: 1
Files: 0
Size:       0
Compressed: 0

Perhaps, something goes wrong with download?

Re: Public Training and Development Dataset: Updates and Fixes  

  By: kreininy on March 16, 2025, 9:23 p.m.

Well, I tried to download and unzip files on Windows, instead of Linux. Unfortunatelly, with the same outcome. The most of files are not recognised by 7zip as archive. Some files (like luna25_images.zip.001) starts decompression, but reports error after that.

Re: Public Training and Development Dataset: Updates and Fixes  

  By: bogdanobreja on March 17, 2025, 11:25 a.m.

Hi kreininy,

Thanks for your comment.

On Windows I was able to extract all the 4069 MHA CTs. I tested this by opening 7-Zip and navigating to the folder containing the LUNA25 '.zip.001', '.zip.002', etc. images. I then selected all the luna25 image files (.zip.001, .zip.002 etc.) and clicked "Extract". Make sure to select all of them otherwise you might run into errors if you just selected a few. Then, I specified the destination folder for extraction. (see image below.)

In the meantime, we are also downloading the Zenodo dataset to ensure there are no issues. We will keep you posted!

Kind Regards, Bogdan.

 Last edited by: bogdanobreja on March 17, 2025, 11:26 a.m., edited 1 time in total.

Re: Public Training and Development Dataset: Updates and Fixes  

  By: kreininy on March 17, 2025, 5:41 p.m.

Thank you for your assistance. Hm-m, it doesn't work for me, so far. :( I also tried to us and received the archive of size 63.4 GiB, which sounds a bit too small. Indeed it contains only few .zip.nnn files, but not all of them:

I'll try to download all files separately and then extract them in one shot. Perhaps, it will work, but I am a bit doubt.

Re: Public Training and Development Dataset: Updates and Fixes  

  By: bogdanobreja on March 17, 2025, 5:59 p.m.

Hi kreininy,

Indeed, I believe that you are missing some of the images .zip files. We're also in the process of unzipping the full downloaded Zenodo dataset and will follow up with you tomorrow.

FYI also a good starting point for developing algorithms would be to use the nodule blocks(luna25_nodule_blocks.zip.001, luna25_nodule_blocks.zip.002) since the baseline code we provide relies solely on the nodule blocks and the dataset CSV. You can find the baseline code here: https://github.com/DIAGNijmegen/luna25-baseline-public

Kind Regards, Bogdan.

 Last edited by: bogdanobreja on March 17, 2025, 6:05 p.m., edited 4 times in total.

Re: Public Training and Development Dataset: Updates and Fixes  

  By: kreininy on March 18, 2025, 4:55 a.m.

I download all files and tried to decompress them as recommended. That's what I have after 3 hours: The process is not done yet.

luna25_images contains 3580 files so far. How many files are supposed to be in the folder?

 Last edited by: kreininy on March 18, 2025, 5:01 a.m., edited 1 time in total.
Reason: fix type error

Re: Public Training and Development Dataset: Updates and Fixes  

  By: bogdanobreja on March 18, 2025, 6:43 a.m.

Hi kreininy,

There are 4069 .MHA CTs in total.

Kind Regards, Bogdan.

Re: Public Training and Development Dataset: Updates and Fixes  

  By: drepeeters on March 18, 2025, 7:29 a.m.

Hi kreininy,

I have used this Python script to download all files from Zenodo: `

import os
import requests
ACCESS_TOKEN = "YOUR ZENODO TOKEN"
record_id = "14223624" #LUNA25 record id

# Specify the output folder where files will be saved
output_folder = "YOUR-OUTPUT_PATH"
os.makedirs(output_folder, exist_ok=True)

# Get the metadata of the Zenodo record
r = requests.get(f"https://zenodo.org/api/records/{record_id}", params={'access_token': ACCESS_TOKEN})

if r.status_code != 200:
    print("Error retrieving record:", r.status_code, r.text)
    exit()

# Extract download URLs and filenames
download_urls = [f['links']['self'] for f in r.json()['files']]
filenames = [f['key'] for f in r.json()['files']]

print(f"Total files to download: {len(download_urls)}")

# Download each file
for index, (filename, url) in enumerate(zip(filenames, download_urls)):
    file_path = os.path.join(output_folder, filename)

    print(f"Downloading file {index}/{len(download_urls)}: {filename} -> {file_path}")

    with requests.get(url, params={'access_token': ACCESS_TOKEN}, stream=True) as r:
        r.raise_for_status()  # Raise an error for failed requests
        with open(file_path, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192):  # Download in chunks
                f.write(chunk)

    print(f"Completed: {filename}")

print("All downloads completed successfully!")

The full dataset consists of 46 luna25_images.zip files and 2 luna25_nodule_blocks.zip files. However, from your screenshot, it appears that you only have 41 luna25_images.zip files.

This suggests that some files are missing and may not have been downloaded correctly, which could be causing the "unexpected end of data" error you are encountering. Could you verify your download and ensure that all required files are present.

As Bogdan mentioned in previous post, we are currently going through this process ourselves to ensure there are no issues on the data side. So far, we have not encountered any problems with data extraction, but the process is still ongoing. We will keep you updated as we complete our verification.

Kind regards,
Dre Peeters

 Last edited by: bogdanobreja on March 18, 2025, 10:26 a.m., edited 3 times in total.

Re: Public Training and Development Dataset: Updates and Fixes  

  By: drepeeters on March 18, 2025, 10:11 a.m.

Hi kreininy,

We were able to succesfully download and extract our data from Zenodo using the Python script and 7-Zip respectively.

If you are still encountering any issues, please don’t hesitate to reach out. We are happy to help.

Kind regards,
Dre Peeters

Re: Public Training and Development Dataset: Updates and Fixes  

  By: kreininy on March 18, 2025, 10:13 a.m.

Thank you. That's a final screenshot. 3743 files were decompressed

Yes, you are right it looks like I missed 5 files. So, I'll try to download them and then repeat decompression.

BTW, how do I get Zenodo Access Token, if I decide to use your code to automate downdload?

Re: Public Training and Development Dataset: Updates and Fixes  

  By: bogdanobreja on March 18, 2025, 10:16 a.m.

Hi kreininy,

You must create an account on Zenodo. Afterwards you will see that your account has an unique Zenodo token that you can use.

Kind Regards, Bogdan.

 Last edited by: bogdanobreja on March 18, 2025, 10:18 a.m., edited 1 time in total.