Hi,
Thanks for your question! Indeed, you are right. We weight AUC higher, because it is, clinically, the most relevant metric.
In the clinical settings, we would like to have an AI systems which can identify a CXR image with a potential nodule, so that those patients would get a CT scan to further investigate this. Therefore, AI system which can classify an image as nodule or no-nodule is the most important feature, hence AUC score is weighted higher. Of course, localization is also very important to build an explainable system, but very precise localization is not the most important feature.
Regarding the evaluation metrics, we do not release our evaluation code since it contains the ground-truth data, and labels for the test set. But our AUC implementation simply uses sklearn, and our FROC implementation is based on Camelyon challenge evaluation code.
Hope this clarifies things a bit, please let us know if you have any other question!
Best, Ecem