Concerns about the reliability of dose evaluation metrics

Concerns about the reliability of dose evaluation metrics  

  By: Fanstan on Sept. 27, 2023, 1:09 p.m.

Dear organizers,

Thank you for your nice organization to make this challenge happen!

During my poster preparation, I have gone through the evaluation results and have concerns about the reliability of dose evaluation metrics. From my understanding, the naive results provided by user 'evihuijben' (organisers) serve as reference to test if evaluation process goes well. Therefore, their results have very high MAE of 332.9 (Task1) and 344.3 (Task2). However, I am quite surprised that some cases of the results from other teams are worse than the naive reference when tested on dose evaluation metrics.

For task 1 (MRI-CT), the total number of cases for top ten teams under all evaluation metrics which are worse than naive reference is listed below:

  1. MAE: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.
  2. PSNR: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.
  3. SSIM: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.
  4. dvh_photon: 12, 15, 12, 14, 15, 17, 17, 21, 26, 18. (Example: 12 cases from the 1st team, 15 cases from the 2nd team .... 18 cases from the 10th team are worse than naive results)
  5. dvh_proton: 2, 3, 1, 1, 2, 1, 2, 5, 28, 2.
  6. gamma_photon: 33, 40, 38, 41, 38, 40, 39, 48, 32, 39.
  7. gamma_proton: 2, 1, 3, 1, 3, 0, 1, 0, 12, 1.
  8. dose_mae_photon: 7, 6, 9, 7, 7, 12, 7, 10, 25, 11.
  9. dose_mae_proton: 3, 0, 1, 2, 2, 2, 4, 0, 2, 1.

For task 2 (CBCT-CT), the total number of cases for top ten teams under all evaluation metrics which are worse than naive reference is listed below:

  1. MAE: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.
  2. PSNR: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.
  3. SSIM: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.
  4. dvh_photon: 6, 5, 9, 7, 8, 7, 9, 14, 10, 12.
  5. dvh_proton: 3, 3, 3, 2, 3, 13, 7, 8, 13, 11.
  6. gamma_photon: 21, 24, 35, 33, 22, 19, 33, 34, 29, 27.
  7. gamma_proton: 2, 1, 8, 3, 2, 0, 4, 11, 3, 0.
  8. dose_mae_photon: 3, 3, 6, 6, 5, 2, 6, 16, 6, 15.
  9. dose_mae_proton: 1, 0, 4, 3, 0, 0, 3, 6, 2, 2.

Especially for the evaluation metric 'gamma_photon', more than 20% of sCT images predicted by top 10 groups have worse performance compared to the naive results. If this is the case, it is no bothering to improve image quality of sCT but giving some homogenous naive CT volumes from binary masks for dose planning.

In addition, 'abnormal' values occur for some cases in the evaluation result. For example, '84' and '113' are common cases in task 1 with much higher 'dvh_proton'. Besides, naive predictions even have 'dvh_proton' value of more than 66 in task 1, which is 100 times more than normal results. In task 2, '110' is a case with abnormal value for evaluation metric 'dvh_photon' among all groups. My assumption is that the framework matRad might not generalize well in all CT volumes. Maybe the existence of metals, or the masks which causes discontinuity in CT are potential reasons for case failure.

I hope my observation could help you with the data analysis in your upcoming journal paper. I am also open for further discussions.

Best regards,

FAYIU Group

Re: Concerns about the reliability of dose evaluation metrics  

  By: mmaspero on Sept. 28, 2023, 4:55 a.m.

Dear FAYIU group,

What a delight to see this detailed analysis shared, thank you very much for sharing your concern and supporting them.

We are diving into the metrics right now and we hope to have a first discussion with you in Vancouver. Note that given the few time at disposal we may not have the definitive answer on October 8, so for the challenge paper this aspect will be surely assessed, and shared back with the rest of the participants.

In the meantime, just a couple of considerations without having touched yet the data for a detailed analysis: DVH are locally dependent metrics, so such differences may be caused by anatomical differences (air, anatomy) between the ground truth and the input data in the proximity of the structure/OAR level; what is more concerning is the gamma. Here the better compliance of the proton gamma may be due to the different geometry of the plan (for photon an IMRT that almost look like a VMAT with a larger volume with D>10% of the prescribed dose, for proton is just a couple of beams), and so the size of the volume considered for the metrics. Surely the numbers of cases with gamma photon higher than our extremely simple baseline needs a valid explanation.

Let's wait for a deeper dive into the metrics. We will come back to you and the rest of the participants with an answer at a later stage, mentioning the point in the paper.

Best regards, and looking forward to meet you in Vancouver,

Matteo

Re: Concerns about the reliability of dose evaluation metrics  

  By: Fanstan on Oct. 4, 2023, 12:43 p.m.

Hi Matteo,

Thank you for your reply!

Sure. We can have a discussion during the workshop. See you there on Sunday.

Best regards,

Fuxin

Re: Concerns about the reliability of dose evaluation metrics  

  By: silvain.beriault on Oct. 17, 2023, 12:51 p.m.

Just sharing my thoughts here.... Could this be explained by some inaccuracies in MR-CT registration? Was the skin mask of the naive baseline method computed directly on the CT. If so, the naive method may be free from registration errors while all the AI methods (computed using a misaligned MRI) will suffer from registration error.

Could it help if the skin mask for the naive method was also computed on the registered (and possibly misaligned) MRI? This way, the same registration error would apply to the naive method as well as all the AI methods.

Anyway, this is just a thought I had...

Re: Concerns about the reliability of dose evaluation metrics  

  By: mmaspero on Nov. 29, 2023, 7:01 a.m.

Dear Fanstan and the FAYIU group,

We took some time to look into the matter. First, thanks once more for raising the point.

For the brain patients of institution B (1BB*** cases), the body mask was not applied to the CT. This was an oversight on our side: most likely, for institution B, the mask was substituted with the defacing mask without considering the body mask only for this institution. Maintaining a FOV larger than the mask resulted in a dose well beyond the patient's body contour on the CT but not on the sCT, leading to the observed dose differences.

For more information on how matRaD handles the extent of the dose calculation region, please take a look at https://github.com/e0404/matRad/wiki/Dose-influence-matrix-calculation. In short, the irradiated geometry is calculated via raytracing based on the delineations provided (the most crucial function should be https://github.com/e0404/matRad/blob/master/matRad_generateStf.m, but in some cases, this may be extended beyond the body contour).

For our application, this means that in one-third of the brain cases (20), the sCT dose may be lower than the simple baseline provided. Why? The baseline uses a mask derived by enlarging the body contour, compensating for material outside the body contour. Twenty cases do not cover all the reported cases: we observed that a comparable issue occurs in some pelvis cases due to the presence of the table, which we did not exclude in the proximity of the body contour.

In this way, all the mentioned cases are explained; still, what about the DVH? Abnormal values in the DVH parameters are explained by patient-specific situations, e.g., residual misalignment of the body contours. We will report these cases in the paper but consider them a proper representation of clinical practice.

A couple of final considerations:

  • The mentioned issues should have a low impact on the ranking, given that they apply equally to all the participants. We are re-evaluating the cases for the upcoming publication.

  • The issue with the body contour is consistent among all the sets (train, validation, and test), ensuring the challenge's fairness.

  • No issue has been seen with the metric implementation.

We hope that this addresses your concern. We are also working to finalize the first draft of the paper. This draft will be shared with the participating teams with a valid submission, and we are excited to hear your comments on these matters after it is shared.

Please don't hesitate to let us know if you have any other concerns.

Best regards,

Matteo

 Last edited by: mmaspero on Nov. 29, 2023, 7:02 a.m., edited 2 times in total.