Clarification on Macro F1 Score Calculation for Nuclei

Clarification on Macro F1 Score Calculation for Nuclei  

  By: NiTo on Feb. 24, 2025, 9:13 a.m.

Dear PUMA Challenge Organizers,

I hope you are doing well. I have a question regarding the method used to estimate the Macro F1 score for nuclei.

From my understanding, the F1 score for each nuclei class is averaged over all frames rather than only the frames where that specific nuclei class is detected. For example, for epithelium nuclei, the calculation appears to be: [Mean F1 of detected frames]/Total number of frames

rather than: [Mean F1 of detected frames]/Number of detected frames

Could you confirm whether this is the case? If so, could you clarify the reasoning behind this approach? It seems that normalizing by the total number of frames could underestimate the F1 score for rarer nuclei types.

I appreciate your time and look forward to your response.

Re: Clarification on Macro F1 Score Calculation for Nuclei  

  By: mschuiveling on Feb. 24, 2025, 2:21 p.m.

Dear NiTo,

Thank you and your team for your enthusiastic participation in the challenge!

The number of samples is not considered in the calculation of the Macro F1 score for nuclei. We accumulate true positives, false positives, and false negatives per class across all samples and use these totals to compute the F1 score per class. The final Macro F1 score is then obtained by averaging across classes, without using the number of samples.

For more information on metric calculation see: PUMA-challenge-eval-track2/eval_nuclei.py

I hope this clarifies your question. Please let me know if you need further explanation.

Kind regards, Mark

Re: Clarification on Macro F1 Score Calculation for Nuclei  

  By: agaldran on March 5, 2025, 3:23 p.m.

Dear Mark,

I am struggling to understand the nuclei instance segmentation evaluation code. First thing I tried was to use your code as-is to evaluate a geo-json against itself, and I expected a perfect score, but no:

So I started digging in, and it seems that no feature is found here:


Also, my idea, after making the above work (which I couldn't), was to predict a np array and then somehow convert it into a geojson and call your eval code. So I went to your baseline_track2 code and tried to find some piece of code that would allow me to do that, something like the inverse process of what you kindly shared with me here. Unfortunately, I am unable to find anything useful, the code is too complex (for me). So, my second question is, assuming my model gives as an output a (1024,1024) image with integers in [0,...,10], where could I find some code to convert this into what is expected by you (geojson I assume)?


Related to this, if I have a "confidence map" of size for example (11,1024,1024), where for each pixel there is the probability of belonging to the background and to each class of nuclei, can I use this? I am asking becasue in the evaluation section you mention something about a ``confidence score''.

Re: Clarification on Macro F1 Score Calculation for Nuclei  

  By: mschuiveling on March 6, 2025, 7:18 a.m.

Dear Adrián,

Thank you for your participation in the challenge! I will go through your questions one by one—please let me know if I have understood them correctly.

  1. Why doesn't evaluating a geoJSON against itself result in a perfect score?
    The Grand Challenge (GC) platform does not use standard GeoJSON but instead relies on a custom JSON format. Since GeoJSON has a different structural design for storing information, it is not directly compatible with the platform, which is why your evaluation did not yield a perfect score.

    You can find more details on the expected JSON format and the coding of polygons here.

  2. How can I convert a NumPy array (output of my model) into the expected JSON format for evaluation?
    The conversion should follow the JSON format required by both the GC platform and our evaluation code. We implemented this process here.

    However, I would recommend modifying our approach to avoid creating a binary mask followed by contour detection. We later found that this method can cause overlapping nuclei of the same class to merge, leading to inaccurate segmentation.

  3. Can I use a confidence map (e.g., a (11,1024,1024) array with probabilities) for evaluation?
    The confidence score is optional—if not provided, it defaults to 1. It is used during metric calculation, where the highest confidence score within the distance threshold is selected. If all confidence scores are set to 1, the closest annotation is used instead. You can include a confidence score for each annotation, specifying a value between 0 and 1. In our code it is called score and it replaces the original probability in the json.

Suggested Approach:

  1. Create code to convert a GeoJSON ground truth into the expected JSON ground truth format (using ChatGPT can speed up this process).
  2. Test the inference code to verify that the structure is correct.
  3. Write code—based on our implementation—to convert a NumPy array into the JSON format required by the evaluation code. Ensure that you perform contour detection separately for each instance rather than on a binary mask (this is slower but more accurate).
  4. Test your output with the evaluation code.
  5. Integrate the process into your container.

Let me know if you need further help!

Best, Mark

Re: Clarification on Macro F1 Score Calculation for Nuclei  

  By: agaldran on March 6, 2025, 12:05 p.m.

Hello Mark,

Thanks for the quick reply, I am now trying to make sense of it all but it looks like a good pointer to start from. Since I already know how to go from geojson to numpy, I think I will directly skip the geojson side and convert both the ground-truth and my predictions from numpy to GC-like jsons so that I can use your code to evaluate. To be clear, I am not trying to run your evaluation container, I am just trying to evaluate my models internally, performing model selection etc.

I am a bit confused about the use of zarr, which I had never tried before, but I hope I will be able to manage. Most likely I will get back here sooner than later, though. Regards, Adrian

** By the way, why do you have this json_filename = os.path.join('/output/melanoma-3-class-nuclei-segmentation.json')? I mean, the output is supposed to be 10-class nuclei segmentation right? Or am I wrong?

Re: Clarification on Macro F1 Score Calculation for Nuclei  

  By: agaldran on March 6, 2025, 2:42 p.m.

Hello again,

Wow, your code is way more so sophisticated that I can handle! I have been an hour or two trying to understand what's going on in your inference code, but I seem to be unable to understand it enough to modify it safely, to be honest.

  • My first question is, it appears (from here) that in both track 1 and track 2, you are dealing with 3-class nuclei instance classification, correct? I thought I read somewhere in the challenge website that for track 1 it was 3-class, but for class 2 it was 10-class? I cannot find it now.
  • My second question is about how to use your code in my scenario to generate a valid json in instance segmentation/classification. I understand from here that your model spits out an array that has 4 channels for classifcation (I guess back/tumor/lymphocite/other, so out_channels_cls = 4), but I do not understand what is inst_channels, and why inst_channels = 5?

    Anyway, you then do lots of things, and end up calling create_polygon_json here, which appears to take pinst_out ("instance segmentation results") and pcls_out ("class map containing instance-to-class mapping") and produces the GC-expected .json. That I guess is the magic function I would like to use also.

So far my model produces directly an array of spatial size n_categoriesx1024x1024 (no tile stitching), with one channel per category, so I get a "hard" segmentation by arg-maxing over categories, and a "soft" probability map of size n_categories x 1024 x 1024 by softmaxing. How can I map these onto your pinst_out array and pcls_out dictionary? I think the use of zarrs is what is confusing me a lot, since I just have plain numpy arrays, and it is extremely hard to understand what functions like this one are doing. Any help is extremely appreciated.

Regards,

Adrian

Re: Clarification on Macro F1 Score Calculation for Nuclei  

  By: danieleek on March 7, 2025, 3:32 p.m.

Hi Adrian,

About your first question: there seems to be a spelling error in the container on GitHub, however the uploaded container on the Grand Challenge platform is functioning normally and accepts 10-class data. Our apologies for the inconvience and to be clear, Track 2 submissions should include 10-class predictions, while Track 1 submissions should produce a 3-class JSON file for nuclei segmentation.

About your second question: you can ignore the variable inst_channels; this is simply a fixed parameter for the encoder used in our baseline algorithm. The parameter out_channels_cls indeed captures the amount of output classes, for Track 1 this is 4 classes (3 classes + 1 background class).

Regarding your approach, I would advise against trying to map your output to the pinst_out and pcls_out format, but rather write a function to directly produce the final JSON file from your numpy arrays. Examples about what output files should look like can be found in the output interfaces list (check View example): https://grand-challenge.org/components/interfaces/outputs/

For a JSON files (nuclei segmentation), the following format is expected:

{ "name": "Areas of interest", "type": "Multiple polygons", "polygons": [ { "name": "Area 1", "seed_point": [ 55.82, 90.46, 0.5 ], "path_points": [ [ 55.82, 90.46, 0.5 ], [ 55.93, 90.88, 0.5 ], [ 56.24, 91.19, 0.5 ], [ 56.66, 91.3, 0.5 ] ], "sub_type": "brush", "groups": [ "manual" ], "probability": 0.67 }, { "name": "Area 2", "seed_point": [ 90.22, 96.06, 0.5 ], "path_points": [ [ 90.22, 96.06, 0.5 ], [ 90.33, 96.48, 0.5 ], [ 90.64, 96.79, 0.5 ] ], "sub_type": "brush", "groups": [], "probability": 0.92 } ], "version": { "major": 1, "minor": 0 } }

After doing this, you should be good to go. If you have any further questions, feel free to reach out.

Kind regards,

Daniel