Regarding the N_SSIM evaluation metric

Regarding the N_SSIM evaluation metric ¶

By: capybara on March 17, 2025, 4:02 a.m.

Hi @dorian_kauffmann,

Sorry for asking at the end of the challenge, but I think something wrong with the current N_SSIM evaluation. According to one discussion here, if we return the inputs as outputs, the N_SSIM should be 0.0 since prediction_ssim == reference_ssim.
I also submitted the code to return inputs as outputs but the N_SSIM were non-zeros on the Test phase leaderboard.

Do you know the reason why?

Re: Regarding the N_SSIM evaluation metric ¶

By: dorian_kauffmann on March 17, 2025, 9:49 a.m.

Hello,

Indeed, it should be 0 (all the 7th participants for test phase).

I've downloaded your prediction and the identical input image. The input image size was about 590MB and your prediction is about 295MB.

I've opened the images in fiji and it appears that our images are in 16-bit type and your predictions are in 8-bit type. I think that's the reason why because in fact the images are not fully identically identical.

Re: Regarding the N_SSIM evaluation metric ¶

By: capybara on March 17, 2025, 11:31 a.m.

Hi @dorian_kauffmann

Thanks for your reply!

I scaled the input data to range [0 - 255] and it was in int8 type. I found that the percentile_normalization function provided in here is sensitive to value range (low_percentile and high_percentile are different if in different value ranges).

In raw TIF files, the pixel values can be quite high (e.g., [0 - 5000]), and normally the algorithms would output data in the value range [0 - 255]. If the output and ground-truth have different data ranges, the percentile_normalization won't be robust.

I know it's quite late to propose another evaluation metric but I'd recommend scaling both output and ground-truth data to the [0-1] range after the percentile_normalization for better evaluation. What do you think?

Re: Regarding the N_SSIM evaluation metric ¶

By: dorian_kauffmann on March 17, 2025, 12:41 p.m.

Thanks for your suggestion!

I understand your point about scaling and normalization, but we are managing microscopy images (very different from standard images). Then, pixel ranges can vary widely across different imaging conditions and depth (between 2 planes over all those available by image, you can have not always the same ranges).

In raw TIF files, the pixel values can be quite high (e.g., [0 - 5000]), and normally the algorithms would output data in the value range [0 - 255].

The outputs doesn’t always need to be in the [0-255] scale (=8-bit).
From the beginning of the challenge, all images are in 16-bit ([0-65,535]).

Scaling both output and ground-truth to [0-1]

I know it's quite late to propose another evaluation metric but I'd recommend scaling both output and ground-truth data to the [0-1] range after the percentile_normalization for better evaluation. What do you think?

I think it’s important to maintain the percentile normalization approach as it is. We will not make any change 10 hours from the end of the Evaluation Phase.

Best