Automated Evaluation

Every challenge has a unique way of objectively evaluating incoming submissions. More often than not, the evaluation scripts come with a set of dependencies and computational environments that are difficult to replicate in the host server. Therefore, we have decided that every challenge organizer has to provide a Docker container that packages the evaluation scripts. This container will run on our servers to compute the evaluation scripts necessary for an incoming submission.

Building your evaluation container

To make the process easier, we created evalutils. Evalutils helps challenge administrators to easily create evaluation containers for It helps you create a project structure, load and validate incoming submissions, and package the evaluation scripts in a Docker container compatible with the requirements of

Note that you do not have to use evalutils: any docker that generates /output/metrics.json correctly, will do.


You can use your favorite Python environment to install evalutils.

$ pip install evalutils

Once you've installed the above requirements, you can follow the instructions for getting started and building your evaluation container here, or watch the videos below to see James take you through an example.

Evaluation containers for algorithm submissions

Evaluation containers for leaderboards that rank algorithms are similar to the containers that rank prediction files, but there is one important difference. Since the platform automatically runs submitted algorithms on a private test set, it assigns random but unique filenames to the outputs of algorithms. However, the platform also supplies a JSON file that tells you how to map the random output filenames with the original input filenames from the input. It also tells you where to read these output files from.

You as a challenge organizer must, therefore, read /input/predictions.json to map the output filenames with the input filenames. This is necessary to evaluate the predictions correctly. Here's an example of how that can be done. In this example, we defined a load_predictions_json function which loads the JSON, loops through the inputs and outputs, and then finds the exact filenames for the outputs.

import json
from pathlib import Path

def load_predictions_json(fname: Path):

    cases = {}

    with open(fname, "r") as f:
        entries = json.load(f)

    if isinstance(entries, float):
        raise TypeError(f"entries of type float for file: {fname}")

    for e in entries:
        # Find case name through input file name
        inputs = e["inputs"]
        name = None
        for input in inputs:
            if input["interface"]["slug"] == "generic-medical-image":
                name = str(input["image"]["name"])
                break  # expecting only a single input
        if name is None:
            raise ValueError(f"No filename found for entry: {e}")

        # Find output value for this case
        outputs = e["outputs"]

        for output in outputs:
            if output["interface"]["slug"] == "generic-medical-image":
                pk = output["image"]["pk"]
                if ".mha" not in pk:
                    pk += ".mha"
                cases[pk] = name

    return cases

We then use the mapping_dict to map the outputs with the actual filenames when computing the metrics in the evaluation script. This is done by updating self._predictions_cases["ground_truth_path"] with the contents of mapping_dict.

self.mapping_dict = load_predictions_json(Path("/input/predictions.json"))
self._predictions_cases["ground_truth_path"] = [
    self._ground_truth_path / self.mapping_dict[Path(path).name]
    for path in self._predictions_cases.path

Next to the predictions.json file, the evaluation container also has access to the algorithm's outputs. You may want to read the output files, for example, if you have large JSON files or heatmaps, or segmentation outputs. The outputs of the algorithm containers are provided to the evaluation container at the following path:



  • job_pk corresponds to the primary key (pk) of each algorithm job, i.e., the top-level "pk" entry for each JSON object in the predictions.json file
  • interface_relative_path corresponds to the relative_path for each of the outputs of a job (for the relative path of the first output of the first algorithm job: [0]["outputs"][0]["interface"]["relative_path"], if your algorithm produces more than one output, you need to loop over the outputs to get their relative paths respectively).

Configuring evaluation Settings

To configure the challenge leaderboard ranking, presentation of the leaderboard to participants, and submission details, navigate to Admin → [name of phase] Evaluation Settings:

Under Phase, the title of the selected phase can be set. This defines the leaderboard name presented to the participants:

Uploading your evaluation container

After building, testing, and exporting the Docker container with the tutorial above, you should have .tar.gz file containing the evaluation Docker container. You can upload this to your challenge by navigating to Admin → Methods → + Add a new method:

Then, select the intended phase of the challenge, and after uploading the Docker container (.tar.gz file), select Save after the evaluation container has been uploaded:

Subsequently, Grand Challenge will verify your new evaluation Docker container. Once this is done and succeeded, you will see Ready: True for your method:


Under the Submission tab, the submission mechanisms for the selected phase can be configured. You can:

  • Indicate opening and closing dates for submissions
  • Allow submissions from only verified participants
  • Provide instructions to the participants on how to make submissions
  • Limit the number of submissions and the time period within which a participant can make submissions
  • Request that participants provide supplementary files when they make submissions (like an ArXiv link)

More instructions on how to configure this mechanism are available under **Admin → Evaluation Settings → Submission **


Under the Scoring tab, the exact specification of the leaderboard ordering can be configured. This is where you connect the outputs of your evaluation container with the automated leaderboard mechanism present in

Assuming that your evaluation container writes the following scores as the output in /output/metrics.json:

    "malignancy_risk_auc": 0.85,
    "nodule_type_accuracy": 0.75

Then you can configure the Scoring mechanisms as shown in the figure below.

  • Make sure to give an appropriate Score title - this will be displayed at the head of the leaderboard.
  • Specify the JSON path of the main metric in Score jsonpath (in this case it is malignancy_risk_auc).

You can also configure a more complex scoring mechanism. Take the following metrics.json, which is a nested dictionary:

  "case": {},
  "aggregates": [
    "dice": {
      "mean": 0.6,
      "std": 0.089
    "accuracy": {
      "mean": 0.5,
      "std": 0.00235

To use both dice and accuracy metrics in the scoring mechanism, and display both scores with their error on the leaderboard you would need to:

  • Enter aggregates.dice.mean in Score jsonpath.
  • Enter aggregates.dice.std in Score error jsonpath.
  • Use Extra results columns to add the accuracy score and error to the leaderboard
  • Set Scoring method choice to determine the way the scoring is calculated.

Additional settings

Under Leaderboard, the look and feel of the leaderboard can be configured. See the comments under each field for additional information.

Under Result Details, you can configure whether the /output/metrics.json from your evaluation container should be accessible to the participants. If you have more information in the metrics.json than you want to share with participants, you should make sure this option is turned off!