Hi, interesting question!
To give you a bit of context: when we framed the challenge, we focused on (vision) foundation models that output a 1D embedding per image (this is the case for the vast majority of models). It also allows control on the total size of the file containing the features. If your model instead produces 2D embeddings, you can consider the following workaround:
If the output height and width are consistent across all image inputs, you can flatten the 2D tensor into 1D before saving, and simply reshape it back to the original 2D shape in your adaptor after loading the (flattened) features from disk.
It would look like this:
foundation_model_output_flattened = ... # embedding_size * H//4 * W//4 tensor
foundation_model_output_2d = foundation_model_output_flattened.reshape(embedding_size, H//4, W//4)
predictions = nn.Conv2d(embedding_size, NUMBER_CLASSES, kernel_size=1, padding=0)(foundation_model_output_2d)
We strongly encourage participants to implement their own adaptor logic! Instructions for doing so are available in the evaluation toolkit repository: the README.md explains how to submit a PR with your custom adaptor.
Alternatively, we could consider extending the accepted feature format to support 2D grids natively. However, this would require internal discussion and may not be feasible in the short term. For now, we recommend using the flattening approach, which should be sufficient in most cases.
Let us know how this works out for you!