Automated Evaluation


Every challenge has a unique way of objectively evaluating incoming submissions. More often than not, the evaluation scripts come with a set of dependencies and computational environments that are difficult to replicate in the host server. Therefore, we have decided that every challenge organizer has to provide a Docker container that packages the evaluation scripts. This container will run on our servers to compute the evaluation scripts necessary for an incoming submission.

Building your evaluation container

To make the process easier, we created a challenge pack with an example evaluation (as well as an example algorithm and upload scripts for archives) for you. When your challenge gets accepted, our support team will create a customized version of this Github repo, tailored to the needs of your challenge. You can then use the provided challenge pack as a starting point for creating your custom evaluation container.

Configuring evaluation Settings

To configure the challenge leaderboard ranking, presentation of the leaderboard to participants, and submission details, navigate to Admin → [name of phase] Evaluation Settings:

Under Phase, the title of the selected phase can be set. This defines the leaderboard name presented to the participants:



Uploading your evaluation container

After building, testing, and exporting the Docker container with the tutorial above, you should have .tar.gz file containing the evaluation Docker container. You can upload this to your challenge by navigating to Admin → Methods → + Add a new method:


Then, select the intended phase of the challenge, and after uploading the Docker container (.tar.gz file), select Save after the evaluation container has been uploaded:


Subsequently, Grand Challenge will verify your new evaluation Docker container. Once this is done and succeeded, you will see Ready: True for your method:


Testing your new evaluation container

When uploading a new evaluation container it's possible to re-evaluate existing submissions to create new evaluation results. Re-evaluation is a manual step since re-evaluating all submissions can be quite expensive. Navigate to Admin → [name of phase] Submission & Evaluations:


Clicking on one of the submission ID links will bring up the submission detail page. On this page, shown below, you will find a button to re-evaluate the submission with the new (active) evaluation method. Doing this re-uses the existing algorithm submissions results: saving compute time and having to upload a new algorithm image every time the evaluation method changes.


Submission

Under the Submission tab, the submission mechanisms for the selected phase can be configured. You can:

  • Indicate opening and closing dates for submissions
  • Allow submissions from only verified participants
  • Provide instructions to the participants on how to make submissions
  • Limit the number of submissions and the time period within which a participant can make submissions
  • Request that participants provide supplementary files when they make submissions (like an ArXiv link)

More instructions on how to configure this mechanism are available under **Admin → Evaluation Settings → Submission **


Scoring

Under the Scoring tab, the exact specification of the leaderboard ordering can be configured. This is where you connect the outputs of your evaluation container with the automated leaderboard mechanism present in grand-challenge.org.

Assuming that your evaluation container writes the following scores as the output in /output/metrics.json:

{
    "malignancy_risk_auc": 0.85,
    "nodule_type_accuracy": 0.75
}


Then you can configure the Scoring mechanisms as shown in the figure below.

  • Make sure to give an appropriate Score title - this will be displayed at the head of the leaderboard.
  • Specify the JSON path of the main metric in Score jsonpath (in this case it is malignancy_risk_auc).

You can also configure a more complex scoring mechanism. Take the following metrics.json, which is a nested dictionary:

{ 
  "case": {},
  "aggregates": [
    "dice": {
      "mean": 0.6,
      "std": 0.089
    },      
    "accuracy": {
      "mean": 0.5,
      "std": 0.00235
    }       
  ]
}


To use both dice and accuracy metrics in the scoring mechanism, and display both scores with their error on the leaderboard you would need to:

  • Enter aggregates.dice.mean in Score jsonpath.
  • Enter aggregates.dice.std in Score error jsonpath.
  • Use Extra results columns to add the accuracy score and error to the leaderboard
  • Set Scoring method choice to determine the way the scoring is calculated.

Additional settings

Under Leaderboard, the look and feel of the leaderboard can be configured. See the comments under each field for additional information.

Under Result Details, you can configure whether the /output/metrics.json from your evaluation container should be accessible to the participants. If you have more information in the metrics.json than you want to share with participants, you should make sure this option is turned off!