Automated Evaluation
Every challenge has a unique way of objectively evaluating incoming submissions. More often than not, the evaluation scripts come with a set of dependencies and computational environments that are difficult to replicate in the host server. Therefore, we have decided that every challenge organizer has to provide a Docker container that packages the evaluation scripts. This container will run on our servers to compute the evaluation scripts necessary for an incoming submission.
Building your evaluation container¶
To make the process easier, we created evalutils. Evalutils helps challenge administrators to easily create evaluation containers for grand-challenge.org. It helps you create a project structure, load and validate incoming submissions, and package the evaluation scripts in a Docker container compatible with the requirements of grand-challenge.org.
Note that you do not have to use evalutils: any docker that generates /output/metrics.json
correctly, will do.
Requirements
You can use your favorite Python environment to install evalutils.
$ pip install evalutils
Once you've installed the above requirements, you can follow the instructions for getting started and building your evaluation container here, or watch the videos below to see James take you through an example.
Evaluation containers for algorithm submissions¶
Evaluation containers for leaderboards that rank algorithms are similar to the containers that rank prediction files, but there is one important difference. Since the platform automatically runs submitted algorithms on a private test set, it assigns random but unique filenames to the outputs of algorithms. However, the platform also supplies a JSON file that tells you how to map the random output filenames with the original input filenames from the input. It also tells you where to read these output files from.
You as a challenge organizer must, therefore, read /input/predictions.json
to map the output filenames with the input filenames. This is necessary to evaluate the predictions correctly. Here's an example of how that can be done. In this example, we defined a load_predictions_json
function which loads the JSON, loops through the inputs and outputs, and then finds the exact filenames for the outputs.
import json
from pathlib import Path
def load_predictions_json(fname: Path):
cases = {}
with open(fname, "r") as f:
entries = json.load(f)
if isinstance(entries, float):
raise TypeError(f"entries of type float for file: {fname}")
for e in entries:
# Find case name through input file name
inputs = e["inputs"]
name = None
for input in inputs:
if input["interface"]["slug"] == "generic-medical-image":
name = str(input["image"]["name"])
break # expecting only a single input
if name is None:
raise ValueError(f"No filename found for entry: {e}")
# Find output value for this case
outputs = e["outputs"]
for output in outputs:
if output["interface"]["slug"] == "generic-medical-image":
pk = output["image"]["pk"]
if ".mha" not in pk:
pk += ".mha"
cases[pk] = name
return cases
We then use the mapping_dict
to map the outputs with the actual filenames when computing the metrics in the evaluation script. This is done by updating self._predictions_cases["ground_truth_path"]
with the contents of mapping_dict
.
self.mapping_dict = load_predictions_json(Path("/input/predictions.json"))
self._predictions_cases["ground_truth_path"] = [
self._ground_truth_path / self.mapping_dict[Path(path).name]
for path in self._predictions_cases.path
]
Next to the predictions.json file, the evaluation container also has access to the algorithm's outputs. You may want to read the output files, for example, if you have large JSON files or heatmaps, or segmentation outputs. The outputs of the algorithm containers are provided to the evaluation container at the following path:
/input/"job_pk"/output/"interface_relative_path"
where:
job_pk
corresponds to the primary key (pk) of each algorithm job, i.e., the top-level "pk" entry for each JSON object in the predictions.json fileinterface_relative_path
corresponds to the relative_path for each of the outputs of a job (for the relative path of the first output of the first algorithm job: [0]["outputs"][0]["interface"]["relative_path"], if your algorithm produces more than one output, you need to loop over the outputs to get their relative paths respectively).
Configuring evaluation Settings¶
To configure the challenge leaderboard ranking, presentation of the leaderboard to participants, and submission details, navigate to Admin → [name of phase] Evaluation Settings:
Under Phase, the title of the selected phase can be set. This defines the leaderboard name presented to the participants:
Uploading your evaluation container¶
After building, testing, and exporting the Docker container with the tutorial above, you should have .tar.gz
file containing the evaluation Docker container. You can upload this to your challenge by navigating to Admin → Methods → + Add a new method:
Then, select the intended phase of the challenge, and after uploading the Docker container (.tar.gz
file), select Save
after the evaluation container has been uploaded:
Subsequently, Grand Challenge will verify your new evaluation Docker container. Once this is done and succeeded, you will see Ready: True
for your method:
Submission¶
Under the Submission tab, the submission mechanisms for the selected phase can be configured. You can:
- Indicate opening and closing dates for submissions
- Allow submissions from only verified participants
- Provide instructions to the participants on how to make submissions
- Limit the number of submissions and the time period within which a participant can make submissions
- Request that participants provide supplementary files when they make submissions (like an ArXiv link)
More instructions on how to configure this mechanism are available under **Admin → Evaluation Settings → Submission **
Scoring¶
Under the Scoring tab, the exact specification of the leaderboard ordering can be configured. This is where you connect the outputs of your evaluation container with the automated leaderboard mechanism present in grand-challenge.org.
Assuming that your evaluation container writes the following scores as the output in /output/metrics.json
:
{
"malignancy_risk_auc": 0.85,
"nodule_type_accuracy": 0.75
}
Then you can configure the Scoring mechanisms as shown in the figure below.
- Make sure to give an appropriate Score title - this will be displayed at the head of the leaderboard.
- Specify the JSON path of the main metric in Score jsonpath (in this case it is
malignancy_risk_auc
).
You can also configure a more complex scoring mechanism.
Take the following metrics.json
, which is a nested dictionary:
{
"case": {},
"aggregates": [
"dice": {
"mean": 0.6,
"std": 0.089
},
"accuracy": {
"mean": 0.5,
"std": 0.00235
}
]
}
To use both dice and accuracy metrics in the scoring mechanism, and display both scores with their error on the leaderboard you would need to:
- Enter
aggregates.dice.mean
in Score jsonpath. - Enter
aggregates.dice.std
in Score error jsonpath. - Use Extra results columns to add the accuracy score and error to the leaderboard
- Set Scoring method choice to determine the way the scoring is calculated.
Additional settings¶
Under Leaderboard, the look and feel of the leaderboard can be configured. See the comments under each field for additional information.
Under Result Details, you can configure whether the /output/metrics.json
from your evaluation container should be accessible to the participants. If you have more information in the metrics.json
than you want to share with participants, you should make sure this option is turned off!