Restormer3D
About
Summary
A self-supervised 3D denoiser for two-photon calcium imaging stacks based on Restormer3D — a 3D extension of the Restormer transformer architecture (Zamir et al., CVPR 2022) adapted to volumetric data. The model is trained per-stack using a Noise2Void blind-spot objective (Krull et al., CVPR 2019), with a temporal-median warmup stage that provides a structural prior before blind-spot training. Submitted to the AI4Life Calcium Imaging Denoising Challenge (CIDC25). The model is pretrained on a multi-stack training set and briefly fine-tuned on each input stack at inference time.
Mechanism
Target use: denoising 3D fluorescence microscopy stacks of shape [T, H, W] (frames × height × width) acquired by two-photon calcium imaging of neural activity.
Architecture: Restormer3D — a U-Net-shaped encoder/decoder with four levels, where each transformer block contains Multi-Dconv Head Transposed Attention (MDTA) (channel-wise attention with depth-wise 3D convolutions on the queries/keys/values) and a Gated-Dconv Feed-Forward Network (GDFN). Configuration: dim=32, num_blocks=(2,2,2,3), num_refinement_blocks=2, heads=(1,2,4,8), ffn_expansion_factor=2.0, bias-free convolutions throughout. ~3.6M parameters, trained in full fp32.
Training: two-stage self-supervised. Stage 0 (warmup) regresses each random 3D patch against the per-stack temporal median (a 2D image computed across all frames) using L1 loss. Stage 1 applies 3D Noise2Void blind-spot masking — random voxels are replaced with neighbors within a small radius, and the model learns to predict the original masked values from spatial+temporal context. No clean ground truth is required at any stage.
Preprocessing: robust per-stack normalization at the 0.5th and 99.5th intensity percentiles. The temporal median is computed once per stack and used only for warmup.
Inference: sliding-window over the input stack with 50% temporal overlap (Hann window blending plus mirror-padding the time axis) to avoid the periodic frame-boundary artifacts that arise from non-overlapping temporal tiles. 50% spatial overlap with Gaussian blending.
Pretrain → fine-tune flow: the model is pretrained offline on a multi-stack training set (noisy stacks only, no clean targets). At submission time the pretrained checkpoint is loaded and the model is briefly fine-tuned (800 N2V iterations, lr=1e-4) on each input stack before inference. This adapts the model to the specific noise statistics of each input.
Inputs: one or more 3D TIFF stacks of noisy calcium imaging frames. Outputs: denoised 3D TIFF stacks of identical shape and dtype as the input.
Interfaces
This algorithm implements all of the following input-output combinations:
| Inputs | Outputs | |
|---|---|---|
| 1 |
Validation and Performance
Challenge Performance
| Date | Challenge | Phase | Rank |
|---|---|---|---|
| May 27, 2026 | AI4LIFE-CIDC25 | Preliminary Phase: Content Generalisation | 1 |
| May 27, 2026 | AI4LIFE-CIDC25 | Final Submission Phase: Content Generalization | 1 |
| May 27, 2026 | AI4LIFE-CIDC25 | Preliminary Phase: Noise Level Generalization | 1 |
| May 27, 2026 | AI4LIFE-CIDC25 | Final Submission Phase: Noise Level Generalization | 1 |
Uses and Directions
This algorithm was developed for research purposes only.