Automated Evaluation


Every challenge has a unique way of objectively evaluating incoming submissions. More often than not, the evaluation scripts come with a set of dependencies and computational environments that are difficult to replicate in the host server. Therefore, we have decided that every challenge organizer has to provide a Docker container that packages the evaluation scripts. This container will run on our servers to compute the evaluation scripts necessary for an incoming submission.

Building your evaluation container

To make the process easier, we created evalutils. Evalutils helps challenge administrators to easily create evaluation containers for grand-challenge.org. It helps you create a project structure, load and validate submissions, and package the evaluation scripts in a Docker container compatible with the requirements of grand-challenge.org. Note that you do not have to use evalutils.

Requirements

You can use your favorite Python environment to install evalutils.

$ pip install evalutils


Once you've installed the above requirements, you can follow the instructions for getting started and building your evaluation container here and watch the videos below to see James take you through an example.


Configuring evaluation Settings

To configure the challenge leaderboard ranking, presentation of the leaderboard to participants, and submission details, navigate to Admin → [name of phase] Evaluation Settings:

Under Phase, the title of the selected phase can be set. This defines the leaderboard name presented to the participants:


Uploading your evaluation container

After building, testing, and exporting the Docker container with the tutorial above, you should have .tar.gz file containing the evaluation Docker container. You can upload this to your challenge by navigating to Admin → Methods → + Add a new method:


Then, select the intended phase of the challenge, and after uploading the Docker container (.tar.gz file), select Save after the evaluation container has been uploaded:


Subsequently, Grand Challenge will verify your new evaluation Docker container. Once this is done and succeeded, you will see Ready: True for your method:



Submission

Under Submission, the submission mechanisms for the selected phase can be configured. You can:

  • Indicate opening and closing dates for submissions
  • Allow submissions from only verified participants
  • Provide instructions to the participants on how to make submissions
  • Limit the number of submissions and the time period within which a participant can make submissions
  • Request that participants provide supplementary files when they make submissions (like an ArXiv link)

More instructions on how to configure this mechanism are available under Admin → Evaluation Settings → Submission

Scoring

Under Scoring, the exact specification of the leaderboard ordering can be configured. This is where you connect the outputs of your evaluation container with the automated leaderboard mechanism present in grand-challenge.org.

Assuming that your evaluation container writes the following aggregated scores as the output in /output/metrics.json:

{
    "malignancy_risk_auc": 0.85,
    "nodule_type_accuracy": 0.75
}


Then you can configure the Scoring mechanisms as shown in the figure below.

  • Make sure to give an appropriate Score title - this will be displayed at the head of the leaderboard.
  • Specify the JSON path of the main metric in Score jsonpath (in this case it is malignancy_risk_auc). If your metrics.json is a nested dictionary like the following:
{
    "case": {},
    "aggregates": {
        "accuracy_score": 0.5
    }
}


then you should enter aggregates.accuracy_score in Score jsonpath.

  • If you want more complex metrics, you can use Extra results columns to add more metrics and then use the methods available under Scoring method choice to combine the metrics and rank the participants on the leaderboard.

In our example, we wanted an average of the malignancy risk AUC and the nodule type accuracy, so we added the following snippet to add nodule type accuracy in the Extra results column:

[
  {
    "path": "nodule_type_accuracy",
    "order": "desc",
    "title": "Nodule Type Accuracy"
  }
]


Additional settings

Under Leaderboard, the look and feel of the leaderboard can be configured. See the comments under each field for additional information.

Under Result Details, you can configure whether the /output/metrics.json from your evaluation container should be accessible to the participants. If you have more information in the metrics.json than you want to share with participants, you should make sure this option is turned off!