Abstract
Semantic ultra-high-resolution (UHR) image segmentation is essential in remote sensing applications such as aerial mapping and environmental monitoring. Transformer-based models remain challenging in this setting because memory grows quadratically with the number of tokens, limiting either spatial resolution or contextual scope. We introduce CASWiT (Context-Aware Stage-Wise Transformer), a dual-branch Swin-based architecture that injects low-resolution contextual information into fine-grained high-resolution features through lightweight stage-wise cross-attention. To strengthen cross-scale learning, we also propose a SimMIM-style pretraining strategy based on masked reconstruction of the high-resolution image. Extensive experiments on the large-scale FLAIR-HUB aerial dataset demonstrate the effectiveness of CASWiT. Under our RGB-only UHR protocol, CASWiT reaches 66.37% mIoU with a SegFormer decoder, improving over strong RGB baselines while also improving boundary quality. On the URUR benchmark, CASWiT reaches 49.2% mIoU under the official evaluation protocol, and it also transfers effectively to medical UHR segmentation benchmarks. Code and pretrained models are available at https://huggingface.co/collections/heig-vd-geo/caswit.
Keywords
Citation
@article{Carreaud2026ContextAware,
title={Context-Aware Semantic Segmentation via Stage-Wise Attention},
author={Antoine Carreaud and Nina Lahellec and Elias Naha and Jan Skaloud and Arthur Chansel and Adrien Gressin},
year={2026},
url={https://cspaper.org/openprint/20260410.0001v1},
journal={OpenPrint:20260410.0001v1}
}Version History
| Version | Released Date | Submitter |
|---|---|---|
v1Current | Apr 10, 2026 | Antoine Carreaud |
