Automated Evaluation

Every challenge has a unique way of objectively evaluating incoming submissions. More often than not, the evaluation scripts come with a set of dependencies and computational environments that are difficult to replicate in the host server. Therefore, we have decided that every challenge organizer has to provide a (Docker) container image that packages the evaluation scripts. This container will run on our servers to compute the evaluation scripts necessary for an incoming submission.

Building your evaluation container

To make the process easier, we created a challenge pack with an example evaluation (as well as an example algorithm and upload scripts for archives) for you. When your challenge gets accepted, our support team will create a customized version of this Github repo, tailored to the needs of your challenge. You can then use the provided challenge pack as a starting point for creating your custom evaluation container.

Uploading your evaluation container

After building, testing, and exporting the Docker container using the scripts provided in the Challenge pack (see above), you should have .tar.gz file containing the evaluation Docker image. You can upload this to your challenge by navigating to Admin ⟶ [Phase name] ⟶ Evaluation Methods and clicking on Add a new method:

Subsequently, Grand Challenge will verify your new evaluation Docker image. Once this is done and succeeded, you will see Active Method for this Phase to indicate that this is the active evaluation method for this phase now.

Testing your new evaluation container

When uploading a new evaluation container it's possible to re-evaluate existing submissions to create new evaluation results. Re-evaluation is a manual step since re-evaluating all submissions can be quite expensive. Navigate to Admin ⟶ [name of phase] ⟶ Submission & Evaluations:

Clicking on one of the submission ID links will bring up the submission detail page. On this page, shown below, you will find a button to re-evaluate the submission with the new (active) evaluation method. Doing this re-uses the existing algorithm submissions results: saving compute time and having to upload a new algorithm image every time the evaluation method changes.

Configuring evaluation Settings

To configure the leaderboard and submission details, navigate to Admin ⟶ [name of phase] ⟶ Settings:

Under Settings and Phase, the title of the selected phase can be set. This defines the leaderboard name presented to the participants:

Submission

Under the Submission tab, the submission mechanism for the selected phase can be configured. You can:

  • Indicate opening and closing dates for submissions
  • Allow submissions from only verified participants
  • Provide instructions to the participants on how to make submissions
  • Limit the number of submissions and the time period within which a participant can make submissions
  • Request that participants provide supplementary files when they make submissions (like an ArXiv link)

More instructions on how to configure this mechanism are available under Admin ⟶ [name of phase] ⟶ Settings ⟶ Submission

Scoring

Under the Scoring tab, you can configure the leaderboard. This is where you connect the outputs of your evaluation container with the automated leaderboard mechanism present in grand-challenge.org.

Assuming that your evaluation container writes the following scores as the output in /output/metrics.json:

{
    "malignancy_risk_auc": 0.85,
    "nodule_type_accuracy": 0.75
}

You can configure the Scoring mechanisms as shown in the figure below.

  • Make sure to give an appropriate Score title - this will be displayed as the column name in the leaderboard.
  • Specify the JSON path of the main metric in Score jsonpath (in this case it is malignancy_risk_auc).

You can also configure a more complex scoring mechanism. Take the following metrics.json, which is a nested dictionary:

{
  "case": {},
  "aggregates": [
    "dice": {
      "mean": 0.6,
      "std": 0.089
    },    
    "accuracy": {
      "mean": 0.5,
      "std": 0.00235
    }   
  ]
}
Please note that the metrics are intended to be concise and focus on the information required for scoring. Some additional evaluation metadata can be included but exessive information should be stored in extra evaluation outputs. If you need assistance with setting these up, please don’t hesitate to contact support.

To use both dice and accuracy metrics in the scoring mechanism, and display both scores with their error on the leaderboard you would need to:

  • Enter aggregates.dice.mean in Score jsonpath.
  • Enter aggregates.dice.std in Score error jsonpath.
  • Use Extra results columns to add the accuracy score and error to the leaderboard
  • Set Scoring method choice to determine the way the scoring is calculated.

Additional settings

Two settings under Scoring allow you to control the runtime environment for your evaluation method. Specifically, you can request the type of GPU and the amount of memory that will be necessary for your evaluation code to run. This controls the selection of the virtual machine instance, just like for the runtime environment for algorithms. The support team controls the options that you can select here, so contact support if you require a different GPU or more memory than you can set here.
Note that these settings do not control the limitations for participants to request GPU type or memory for their algorithm inference jobs. These settings are also controlled by the support team and are set up together with you when you request the challenge.

Under Leaderboard, the look and feel of the leaderboard can be configured. See the comments under each field for additional information.

Under Result Details, you can configure whether the complete /output/metrics.json from your evaluation container should be displayed on the results detail page, or whether only the metrics used for ranking should be displayed in a nicer looking list.