Concerns about the reliability of dose evaluation metrics ¶
By: Fanstan on Sept. 27, 2023, 1:09 p.m.
Dear organizers,
Thank you for your nice organization to make this challenge happen!
During my poster preparation, I have gone through the evaluation results and have concerns about the reliability of dose evaluation metrics. From my understanding, the naive results provided by user 'evihuijben' (organisers) serve as reference to test if evaluation process goes well. Therefore, their results have very high MAE of 332.9 (Task1) and 344.3 (Task2). However, I am quite surprised that some cases of the results from other teams are worse than the naive reference when tested on dose evaluation metrics.
For task 1 (MRI-CT), the total number of cases for top ten teams under all evaluation metrics which are worse than naive reference is listed below:
- MAE: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.
- PSNR: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.
- SSIM: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.
- dvh_photon: 12, 15, 12, 14, 15, 17, 17, 21, 26, 18. (Example: 12 cases from the 1st team, 15 cases from the 2nd team .... 18 cases from the 10th team are worse than naive results)
- dvh_proton: 2, 3, 1, 1, 2, 1, 2, 5, 28, 2.
- gamma_photon: 33, 40, 38, 41, 38, 40, 39, 48, 32, 39.
- gamma_proton: 2, 1, 3, 1, 3, 0, 1, 0, 12, 1.
- dose_mae_photon: 7, 6, 9, 7, 7, 12, 7, 10, 25, 11.
- dose_mae_proton: 3, 0, 1, 2, 2, 2, 4, 0, 2, 1.
For task 2 (CBCT-CT), the total number of cases for top ten teams under all evaluation metrics which are worse than naive reference is listed below:
- MAE: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.
- PSNR: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.
- SSIM: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.
- dvh_photon: 6, 5, 9, 7, 8, 7, 9, 14, 10, 12.
- dvh_proton: 3, 3, 3, 2, 3, 13, 7, 8, 13, 11.
- gamma_photon: 21, 24, 35, 33, 22, 19, 33, 34, 29, 27.
- gamma_proton: 2, 1, 8, 3, 2, 0, 4, 11, 3, 0.
- dose_mae_photon: 3, 3, 6, 6, 5, 2, 6, 16, 6, 15.
- dose_mae_proton: 1, 0, 4, 3, 0, 0, 3, 6, 2, 2.
Especially for the evaluation metric 'gamma_photon', more than 20% of sCT images predicted by top 10 groups have worse performance compared to the naive results. If this is the case, it is no bothering to improve image quality of sCT but giving some homogenous naive CT volumes from binary masks for dose planning.
In addition, 'abnormal' values occur for some cases in the evaluation result. For example, '84' and '113' are common cases in task 1 with much higher 'dvh_proton'. Besides, naive predictions even have 'dvh_proton' value of more than 66 in task 1, which is 100 times more than normal results. In task 2, '110' is a case with abnormal value for evaluation metric 'dvh_photon' among all groups. My assumption is that the framework matRad might not generalize well in all CT volumes. Maybe the existence of metals, or the masks which causes discontinuity in CT are potential reasons for case failure.
I hope my observation could help you with the data analysis in your upcoming journal paper. I am also open for further discussions.
Best regards,
FAYIU Group