Automated Evaluation
Every challenge has a unique way of objectively evaluating incoming submissions. More often than not, the evaluation scripts come with a set of dependencies and computational environments that are difficult to replicate in the host server. Therefore, we have decided that every challenge organizer has to provide a Docker container that packages the evaluation scripts. This container will run on our servers to compute the evaluation scripts necessary for an incoming submission.
Building your evaluation container¶
To make the process easier, we created a challenge pack with an example evaluation (as well as an example algorithm and upload scripts for archives) for you. When your challenge gets accepted, our support team will create a customized version of this Github repo, tailored to the needs of your challenge. You can then use the provided challenge pack as a starting point for creating your custom evaluation container.
Configuring evaluation Settings¶
To configure the challenge leaderboard ranking, presentation of the leaderboard to participants, and submission details, navigate to Admin → [name of phase] Evaluation Settings:
Under Phase, the title of the selected phase can be set. This defines the leaderboard name presented to the participants:
Uploading your evaluation container¶
After building, testing, and exporting the Docker container with the tutorial above, you should have .tar.gz
file containing the evaluation Docker container. You can upload this to your challenge by navigating to Admin → Methods → + Add a new method:
Then, select the intended phase of the challenge, and after uploading the Docker container (.tar.gz
file), select Save
after the evaluation container has been uploaded:
Subsequently, Grand Challenge will verify your new evaluation Docker container. Once this is done and succeeded, you will see Ready: True
for your method:
Testing your new evaluation container¶
When uploading a new evaluation container it's possible to re-evaluate existing submissions to create new evaluation results. Re-evaluation is a manual step since re-evaluating all submissions can be quite expensive. Navigate to Admin → [name of phase] Submission & Evaluations:
Clicking on one of the submission ID links will bring up the submission detail page. On this page, shown below, you will find a button to re-evaluate the submission with the new (active) evaluation method. Doing this re-uses the existing algorithm submissions results: saving compute time and having to upload a new algorithm image every time the evaluation method changes.
Submission¶
Under the Submission tab, the submission mechanisms for the selected phase can be configured. You can:
- Indicate opening and closing dates for submissions
- Allow submissions from only verified participants
- Provide instructions to the participants on how to make submissions
- Limit the number of submissions and the time period within which a participant can make submissions
- Request that participants provide supplementary files when they make submissions (like an ArXiv link)
More instructions on how to configure this mechanism are available under **Admin → Evaluation Settings → Submission **
Scoring¶
Under the Scoring tab, the exact specification of the leaderboard ordering can be configured. This is where you connect the outputs of your evaluation container with the automated leaderboard mechanism present in grand-challenge.org.
Assuming that your evaluation container writes the following scores as the output in /output/metrics.json
:
{
"malignancy_risk_auc": 0.85,
"nodule_type_accuracy": 0.75
}
Then you can configure the Scoring mechanisms as shown in the figure below.
- Make sure to give an appropriate Score title - this will be displayed at the head of the leaderboard.
- Specify the JSON path of the main metric in Score jsonpath (in this case it is
malignancy_risk_auc
).
You can also configure a more complex scoring mechanism.
Take the following metrics.json
, which is a nested dictionary:
{
"case": {},
"aggregates": [
"dice": {
"mean": 0.6,
"std": 0.089
},
"accuracy": {
"mean": 0.5,
"std": 0.00235
}
]
}
To use both dice and accuracy metrics in the scoring mechanism, and display both scores with their error on the leaderboard you would need to:
- Enter
aggregates.dice.mean
in Score jsonpath. - Enter
aggregates.dice.std
in Score error jsonpath. - Use Extra results columns to add the accuracy score and error to the leaderboard
- Set Scoring method choice to determine the way the scoring is calculated.
Additional settings¶
Under Leaderboard, the look and feel of the leaderboard can be configured. See the comments under each field for additional information.
Under Result Details, you can configure whether the /output/metrics.json
from your evaluation container should be accessible to the participants. If you have more information in the metrics.json
than you want to share with participants, you should make sure this option is turned off!