3D N2V U Net DVT Inspired
About
Summary
The algorithm is a 3D U-Net with a transformer bottleneck that adapts the Denoising Vision Transformers (Yang et al., 2024) framework for image-level denoising of calcium-imaging video stacks. The paper's core insight is that any ViT's output can be decomposed into three terms — clean signal f(x), position-dependent artifact g(E_pos), and a residual interaction h(x, E_pos). This decomposition transfers naturally to fluorescence microscopy where the three terms map onto the true calcium signal, fixed sensor pattern noise, and signal-dependent shot/read noise respectively. The network processes a noisy 3D patch through a conventional convolutional encoder (two downsampling stages), passes the bottleneck features through a transformer block that explicitly implements the DVT decomposition (a learnable artifact field G, a 3-layer residual MLP, and a single-Transformer-block denoiser with new positional embeddings — the configuration the paper found best in Table 6, row d), then reconstructs the output through a symmetric convolutional decoder with skip connections. The output is residual: the network predicts the noise component, which is subtracted from the input to yield the denoised stack.
Mechanism
Target population¶
The algorithm targets researchers working with two-photon calcium-imaging recordings of neuronal populations (typically rodent cortex) used in systems neuroscience. It is a research tool for improving low-SNR fluorescence microscopy video — not for clinical use — so that downstream analyses such as ROI segmentation and spike inference become more reliable.
Algorithm description¶
A self-supervised 3D U-Net with a transformer bottleneck inspired by Denoising Vision Transformers (Yang et al., 2024). The bottleneck implements the paper's clean-signal / position-artifact / residual-noise decomposition via a learnable artifact field, a 3-layer residual MLP, and a single-Transformer-block denoiser with new positional embeddings. Training is zero-shot per stack: a 400-iteration temporal-median warmup is followed by 4000 iterations of 3D Noise2Void. Inference uses overlapping sliding windows with Gaussian blending.
Inputs and outputs¶
Input: a noisy calcium-imaging stack as a 3D TIFF of shape [F, H, W] (typically [1500, 490, 490]).
Output: a denoised stack of identical shape and dtype, with calcium transients preserved and shot noise / fixed-pattern artifacts suppressed.
Interfaces
This algorithm implements all of the following input-output combinations:
| Inputs | Outputs | |
|---|---|---|
| 1 |
Validation and Performance
Challenge Performance
| Date | Challenge | Phase | Rank |
|---|---|---|---|
| May 5, 2026 | AI4LIFE-CIDC25 | Final Submission Phase: Content Generalization | 3 |
Uses and Directions
This algorithm was developed for research purposes only.