Bram van Ginneken originally published June 2015, updated August 2020
It all started with an e-mail:
from Tobias Heimann
to Bram van Ginneken
date Thu, Dec 21, 2006 at 10:33 PM
subject Idea for workshop about clinical segmentation
Just before going into Christmas holidays, I'd like to tell you of an idea for an alternative workshop: I remember a comment of you at MICCAI where you said (from my vague memories :-)) that many presentations there were quite theoretical and not really suited for clinical application. Well, my boss (Pitt Meinzer) had the same impression and he told Gabor Szekely (from ETH Zurich) about it, who also agreed. Some weeks later Gabor came up with some ("crazy" as he termed it) ideas what a really nice workshop should look like and my boss delegated it to me to organize something in that direction...
So my current plan is to organize a MICCAI workshop about application-oriented 3D segmentation. There would be one organ of interest (heart, lungs, kidneys, liver) for which some datasets with reference segmentations are uploaded on a website. People can tune their algorithms (automatic or interactive) for these images and write papers about the methods they use to solve the problem. Maybe they could also include their results on the reference data. At the workshop there won't be any talks, just posters. Maybe there's one invited talk about the medical relevance of the problem, if not it would just start with poster teasers. Then some time to look at the posters. The highlight would be a live evaluation of the submitted algorithms on new data, automatic comparison with the new reference and the resulting performance charts :-) Then lunch break.
After lunch, there are a number of (let's say 3) groups where people get assigned to. They discuss the 3 best-working algorithms using the papers the authors submitted. Then everybody comes together again and the groups present the papers they have been working on. Note that the actual authors won't be in the group treating their own papers, so they can kick in during the following discussion to defend their methods. Basically, the rest of the day would be for discussion and exchange of new ideas...
So, do you think that format could work? How many and who might be interested to submit something for a workshop like that? And what would be a good organ of interest to start with? Please let me know what you think of this whole thing, I'm open for all suggestions :-)
The result, a couple of hundred e-mails later, was SLIVER07 and CAUSE07, and the first workshop at MICCAI devoted to a direct comparison between medical image segmentation algorithms, with live processing of a set of scans that were released when the workshop started. It was a huge success, the room was packed, laptops crashed, but not too often, there were prizes awarded:
Tobias Heimann compiled the results on the liver segmentation challenge into an excellent overview paper that appeared in 2009 and quickly garnished hundreds of citations:
Thirteen years later, SLIVER07 is still running on this site, and still receiving submissions even though a more recent challenge for liver segmentation in CT was proposed in 2017. Challenge workshops became a tradition at MICCAI and other medical image analysis conferences. This website lists almost 300 such initiatives.
Why organize challenges?
Probably the majority of papers in medical and biomedical image analysis describe an algorithm or solution that addresses a particular task. In many cases, maybe even in most cases, the same, or a very similar task has been addressed already in prior papers. If these two conditions are met, 1) a task is defined and 2) several solutions have been proposed, there is in principle the need for a public challenge on that task, where both data and evaluation procedures are common to all participants.
However, for most image analysis problems there is no public database and reference standard available, or the public database is infrequently used for a variety of reasons. Results are therefore almost always reported on proprietary datasets which vary enormously in many different respects, e.g. size, difficulty, scans properties. Various methods of obtaining reference standards are described by different authors, further compounding the difficulties in comparing results. Even if researchers do use public data in their experiments (as is quite common in e.g. the computer vision research community and also increasingly more common in the medical and biomedical image analysis) they often may select particular scans from the database and omit others and almost always use a variety of different methods of evaluation, making their final results, again, not directly comparable.
In an effort to compare results in a fair manner, it is common practice in the literature to attempt to implement or run several algorithms, applying them all to a single dataset and evaluating their results in the same way. However, the difficulty with this approach is that individual methods may be incorrectly implemented or poorly optimized for the task at hand. Many image analysis algorithms are relatively complex and are best understood, implemented, and parameterized by their authors.
The anatomy of a challenge
The theory of the public challenge is straightforward:
- A task is defined (what is the desired output for a given input).
- A set of test images are provided (the input). Sometimes training data (pairs of input images and the desired output) are provided. Sometimes challenge participants are restricted to only use these training data to develop or fine-tune their algorithms.
- An evaluation procedure is clearly defined: given the output of an algorithm on the test images, one or more metrics are computed that quantify algorithm performance. Usually, a reference output is used in this process, but it could also be a visual evaluation of the results by human experts.
- Participants apply their algorithm to all data in the public test dataset provided.
- The responsibility to ensure that the given scans are representative of the type of data generally encountered in research and clinical practice lies with the challenge organizers.
- The reference standard is defined using methodology clearly described to the participants but is not made publicly available in order to ensure that algorithm results are submitted to the organizers for publication rather than retained privately.
- Evaluation is carried out by the challenge organizers.
- The image analysis algorithms are typically run and optimized by their authors or by researchers very familiar with their operation, meaning that they obtain the best possible results for the given dataset.
- A variation on this theme would be to ask the participants to provide their algorithm, for example in the form of binary executables or a virtual machine, and the organizers run the algorithm on the test data. In this way, the test data is kept completely secret.
From the description above, it should be clear that organizing a challenge is a lot of work and a large responsibility for the success of a challenge lies with the organizers. Why would someone spend all this time on such an effort? And why would someone participate? After all, the prospective participant may have already published his or her algorithm and have moved on to their next scientific project. I have often heard from potential organizers and participants that they 'had no funding to do this' and they complained it would not lead to new publications. This notion is wrong. The results of the first years of challenges in medical image analysis have been published in papers in the highest journals in our field, and these papers have attracted a lot of citations. Here is an overview (the list of papers is far from complete) I compiled in August 2020, using Publish or Perish using Google Scholar citations. The listed rank is based on all publications in that journal in the year of publication.
- PROMISE12, Medical Image Analysis 2014, 279 citations, rank 1/137
- Particle Tracking Challenge 2012, Nature Methods 2014, 622 citations, rank 12/386
- The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS), IEEE Transactions on Medical Imaging 2014, 1795 citations, rank 1/252
- EXACT09, IEEE Transactions on Medical Imaging 2012, 188 citations, rank 7/195
- EMPIRE10, IEEE Transactions on Medical Imaging 2011, 391 citations, rank 3/185
- The Carotid Bifurcation Algorithm Evaluation Framework, Medical Image Analysis 2011, 82 citations, rank 23/82
- ANODE09, Medical Image Analysis 2010, 229 citations, rank 4/92
- ROC, 2010, IEEE Transactions on Medical Imaging 2009, 447 citations, rank 5/182
- The Rotterdam Coronary Artery Algorithm Evaluation Framework, Medical Image Analysis 2009, 310 citations, rank 5/105
- Sliver07, IEEE Transaction on Medical Imaging 2009, 816 citations, rank 2/182
These overview papers act as topical reviews where the state of the art in a particular field is summarized, backed up with experimental proof from the challenge results.
Challenges for challenges
For researchers interested in organizing a public challenge in their own field there are a number of important considerations. The list below is partly taken and adapted from the general discussion in the PhD thesis of Keelin Murphy, organizer of the EMPIRE10 challenge.
- Firstly, select an appropriate task. A problem that only very few groups are addressing may not (yet) be a good idea for a competition, as there are simply not enough potential participants (yet). A problem that is more or less solved is also not a good idea.
- It is important to be able to gather enough data with enough variability to represent the diverse types of problems typically encountered by researchers in the field. A challenge is as good as its data.
- The methods of defining the reference standard and of evaluating algorithm results must be clearly defined and generally agreeable to the academic community. The challenge is unlikely to attract interest from serious contenders if these methods are poorly considered or open to question.
- The details of the challenge should be well publicized and advertised in the relevant circles in order to attract a reasonable number of participants, without which the final results will be of less interest. Collecting all prior work relevant to the task as hand and personally inviting the authors of this work has been shown to be a good procedure.
- In order to achieve these goals it is generally to be recommended that the challenge organizers include a number of people from different backgrounds with experience in the field. These might include researchers from a number of different academic institutes, as well as from industry, who have worked on a variety of projects related to the topic of interest. This will ensure access to a larger pool of data and contacts, as well as a balanced set of opinions on how to define the reference standards and evaluate algorithm performance. Consider placing an open call for organizers, or for data.
- A good challenge should not end with a workshop at a conference. Make sure you include a high-quality website with your challenge, set up in such a way that for many years to come, new submissions can be processed quickly and efficiently.
- Plan ahead: preparation and modification of the data, metrics, rules while the challenge is already running is annoying for participants.
While the many advantages of the public challenge are clear, one issue with the solution is that commercial vendors are frequently reluctant to enter their algorithms into a public challenge for fear of exposing their weak points. The public challenge generally requires that the owners of the software are identified and even that details of the algorithm are described. In order to attract more participants from industry in the future, it may be necessary to allow them to remain anonymous and to conceal the details of their system if they so wish. While this compromises the information available to the interested public, it also enriches the results by ensuring that important competitors in the field are not excluded. A rule might be considered, for example, which states that the three highest-ranking algorithms must be identified, which would give the public the most important information about the state of the art while protecting any participants from potential embarrassment due to poor results.