Automated Evaluation
Every challenge has a unique way of objectively evaluating incoming submissions. More often than not, the evaluation scripts come with a set of dependencies and computational environments that are difficult to replicate in the host server. Therefore, we have decided that every challenge organizer has to provide a (Docker) container image that packages the evaluation scripts. This container will run on our servers to compute the evaluation scripts necessary for an incoming submission.
Building your evaluation container¶
To make the process easier, we created a challenge pack with an example evaluation (as well as an example algorithm and upload scripts for archives) for you. When your challenge gets accepted, our support team will create a customized version of this Github repo, tailored to the needs of your challenge. You can then use the provided challenge pack as a starting point for creating your custom evaluation container.
Configuring evaluation Settings¶
To configure the leaderboard and submission details, navigate to Admin → [name of phase] → Settings:
Under Settings and Phase, the title of the selected phase can be set. This defines the leaderboard name presented to the participants:
Uploading your evaluation container¶
After building, testing, and exporting the Docker container using the scripts provided in the Challenge pack (see above), you should have .tar.gz
file containing the evaluation Docker image. You can upload this to your challenge by navigating to Admin → [Phase name] → Evaluation Methods and clicking on Add a new method:
Subsequently, Grand Challenge will verify your new evaluation Docker image. Once this is done and succeeded, you will see Active Method for this Phase to indicate that this is the active evaluation method for this phase now.
Testing your new evaluation container¶
When uploading a new evaluation container it's possible to re-evaluate existing submissions to create new evaluation results. Re-evaluation is a manual step since re-evaluating all submissions can be quite expensive. Navigate to Admin → [name of phase] → Submission & Evaluations:
Clicking on one of the submission ID links will bring up the submission detail page. On this page, shown below, you will find a button to re-evaluate the submission with the new (active) evaluation method. Doing this re-uses the existing algorithm submissions results: saving compute time and having to upload a new algorithm image every time the evaluation method changes.
Submission¶
Under the Submission tab, the submission mechanism for the selected phase can be configured. You can:
- Indicate opening and closing dates for submissions
- Allow submissions from only verified participants
- Provide instructions to the participants on how to make submissions
- Limit the number of submissions and the time period within which a participant can make submissions
- Request that participants provide supplementary files when they make submissions (like an ArXiv link)
More instructions on how to configure this mechanism are available under **Admin → [name of phase] → Settings → Submission **
Scoring¶
Under the Scoring tab, you can configure the leaderboard. This is where you connect the outputs of your evaluation container with the automated leaderboard mechanism present in grand-challenge.org.
Assuming that your evaluation container writes the following scores as the output in /output/metrics.json
:
{
"malignancy_risk_auc": 0.85,
"nodule_type_accuracy": 0.75
}
You can configure the Scoring mechanisms as shown in the figure below.
- Make sure to give an appropriate Score title - this will be displayed as the column name in the leaderboard.
- Specify the JSON path of the main metric in Score jsonpath (in this case it is
malignancy_risk_auc
).
You can also configure a more complex scoring mechanism.
Take the following metrics.json
, which is a nested dictionary:
{
"case": {},
"aggregates": [
"dice": {
"mean": 0.6,
"std": 0.089
},
"accuracy": {
"mean": 0.5,
"std": 0.00235
}
]
}
To use both dice and accuracy metrics in the scoring mechanism, and display both scores with their error on the leaderboard you would need to:
- Enter
aggregates.dice.mean
in Score jsonpath. - Enter
aggregates.dice.std
in Score error jsonpath. - Use Extra results columns to add the accuracy score and error to the leaderboard
- Set Scoring method choice to determine the way the scoring is calculated.
Additional settings¶
Under Leaderboard, the look and feel of the leaderboard can be configured. See the comments under each field for additional information.
Under Result Details, you can configure whether the complete /output/metrics.json
from your evaluation container should be displayed on the results detail page, or whether only the metrics used for ranking should be displayed in a nicer looking list.