Templates

ECCV 2026

Preview

ECCV 2026

ECCV 2026 European Conference on Computer Vision paper template

Category

Conference

License

Free to use (MIT)

File

eccv-2026/main.tex

main.texRead-only preview
\documentclass[runningheads]{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage[margin=0.75in]{geometry}
\usepackage{amsmath,amssymb}
\usepackage{graphicx}
\usepackage{booktabs}
\usepackage{enumitem}
\usepackage{xcolor}
\usepackage{times}
\usepackage{natbib}
\usepackage{hyperref}
\usepackage{url}
\usepackage{microtype}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{multirow}
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{twocolumn}

\setlength{\columnsep}{0.25in}

\title{\Large\bfseries Token-Efficient Vision Transformers via\\Dynamic Spatial Pruning for Dense Prediction}

\author{
  \textbf{Sofia Andersson}\textsuperscript{1}\quad
  \textbf{Kenji Yamamoto}\textsuperscript{2}\quad
  \textbf{Laura Bianchi}\textsuperscript{1}\quad
  \textbf{Ahmed Hassan}\textsuperscript{3}\\[4pt]
  \textsuperscript{1}Visual Geometry Group, University of Oxford\\
  \textsuperscript{2}National Institute of Informatics, Tokyo\\
  \textsuperscript{3}Technical University of Munich\\[2pt]
  {\small\texttt{\{sofia,laura\}@robots.ox.ac.uk, [email protected], [email protected]}}
}

\date{}

\begin{document}
\maketitle

\begin{abstract}
Vision Transformers (ViTs) achieve remarkable performance on visual recognition tasks but suffer from quadratic computational complexity in the number of tokens. For dense prediction tasks like semantic segmentation and object detection, where high spatial resolution is critical, this cost becomes prohibitive. We propose \textsc{DynaPrune-ViT}, a token-efficient vision transformer that dynamically prunes redundant spatial tokens at each layer based on learned importance scores. Unlike prior token pruning methods designed for classification, our approach preserves spatial structure by maintaining a sparse token grid and reconstructing dense predictions through a lightweight upsampling module. A key innovation is our \emph{deferred pruning} strategy that makes hard pruning decisions only after aggregating information through several attention layers, avoiding premature token removal. On ADE20K semantic segmentation, \textsc{DynaPrune-ViT} achieves 48.7 mIoU while reducing FLOPs by 42\%, outperforming existing efficient ViT approaches. On COCO object detection, we achieve 51.3 AP with 38\% fewer FLOPs than the dense baseline.
\end{abstract}

\section{Introduction}

Vision Transformers \citep{dosovitskiy2021image} have become the architecture of choice for visual recognition, achieving state-of-the-art results across classification, detection, and segmentation. However, their self-attention mechanism has $O(N^2)$ complexity in the number of tokens $N$, which presents a fundamental scalability challenge. For dense prediction tasks that require high-resolution features, the computational cost of processing thousands of spatial tokens through multiple attention layers becomes a critical bottleneck.

Prior work on efficient ViTs has pursued several directions: local attention windows \citep{liu2021swin}, linear attention approximations \citep{katharopoulos2020transformers}, and token reduction \citep{rao2021dynamicvit}. Token reduction is particularly attractive because it directly addresses the quadratic complexity by reducing $N$. However, existing approaches are designed for classification, where a single global prediction requires only the class token. For dense prediction tasks, every spatial location matters, and naive token pruning destroys the spatial structure needed for pixel-level predictions.

We propose \textsc{DynaPrune-ViT}, which addresses this challenge through three key ideas:

\begin{enumerate}[nosep,leftmargin=*]
  \item \textbf{Importance-aware token scoring}: Each layer computes learned importance scores for tokens, identifying regions that require detailed processing versus those that can be summarized.
  \item \textbf{Deferred pruning}: Rather than pruning at the earliest possible layer, we allow tokens to participate in several attention layers first, accumulating contextual information before making irreversible pruning decisions.
  \item \textbf{Sparse-to-dense reconstruction}: A lightweight module reconstructs full spatial resolution from the sparse token set using position-aware interpolation.
\end{enumerate}

\section{Related Work}

\paragraph{Efficient Vision Transformers.}
Swin Transformer \citep{liu2021swin} restricts attention to local windows with shifted patterns. PVT \citep{wang2021pyramid} uses spatial-reduction attention. EfficientViT \citep{cai2023efficientvit} combines hardware-friendly operations with cascade attention. These approaches change the attention pattern; our work is complementary as it reduces the token count.

\paragraph{Token Pruning and Merging.}
DynamicViT \citep{rao2021dynamicvit} learns to prune inattentive tokens for classification. ToMe \citep{bolya2023token} merges similar tokens via bipartite matching. EViT \citep{liang2022evit} fuses pruned tokens into a single token. These methods focus on classification; adapting them to dense prediction requires maintaining spatial structure, which is our focus.

\paragraph{Dense Prediction with ViTs.}
SegFormer \citep{xie2021segformer} uses a hierarchical ViT with a lightweight decoder. Mask2Former \citep{cheng2022masked} combines masked attention with multiscale features. Our pruning approach can be applied to any ViT backbone used for dense prediction.

\section{Method}

\subsection{Token Importance Scoring}

At each layer $l$, we compute importance scores $\mathbf{s}^l \in \mathbb{R}^{N_l}$ for the current token set. The scoring function combines local saliency and global context:
\begin{equation}
s_i^l = \sigma\!\left(\mathbf{w}_s^T \left[\mathbf{h}_i^l \,\|\, \frac{1}{N_l}\sum_{j=1}^{N_l} \alpha_{ij}\mathbf{h}_j^l\right]\right)
\end{equation}
where $\mathbf{h}_i^l$ is the token representation, $\alpha_{ij}$ are attention weights from the preceding self-attention, and $\sigma$ is the sigmoid function. The concatenation of local features and attention-weighted global features allows the scorer to assess each token's importance in context.

\subsection{Deferred Pruning Strategy}

Rather than pruning at every layer, we define pruning stages at layers $\{l_1, l_2, \ldots, l_S\}$. At stage $s$, we retain the top-$k_s$ tokens by importance score:
\begin{equation}
\mathcal{T}^{l_s} = \text{TopK}(\mathbf{s}^{l_s}, k_s) \quad\text{where}\quad k_s = \lfloor r_s \cdot N_0 \rfloor
\end{equation}

The retention ratios $r_1 > r_2 > \cdots > r_S$ define a pruning schedule. We use straight-through Gumbel-Softmax for differentiable training of the discrete selection.

For unpruned layers between pruning stages, all current tokens participate in attention, allowing information to flow from soon-to-be-pruned tokens to retained ones. This deferred approach gives the network time to consolidate information before committing to pruning decisions.

\subsection{Sparse-to-Dense Reconstruction}

After the final ViT layer, we have $k_S$ tokens at sparse spatial positions. To reconstruct dense features for pixel-level prediction, we use position-aware feature propagation:
\begin{equation}
\hat{\mathbf{f}}(p) = \sum_{i \in \mathcal{T}_{\text{final}}} w(p, p_i) \cdot \mathbf{h}_i^L
\end{equation}
where $w(p, p_i)$ combines inverse distance weighting with a learned position-dependent kernel:
\begin{equation}
w(p, p_i) = \frac{\exp(-\|p - p_i\|_2 / \tau + \phi(p, p_i))}{\sum_j \exp(-\|p - p_j\|_2 / \tau + \phi(p, p_j))}
\end{equation}
and $\phi$ is a small MLP that modulates weights based on the relative position encoding.

\subsection{Training}

The total loss combines the task loss with a pruning regularizer:
\begin{equation}
\mathcal{L} = \mathcal{L}_{\text{task}} + \lambda \sum_{s=1}^{S} \left|\frac{1}{N_0}\sum_{i} s_i^{l_s} - r_s\right|^2
\end{equation}
The regularizer encourages the learned importance scores to match the target retention ratio, avoiding degenerate solutions where all tokens have similar scores.

\section{Experiments}

\subsection{Semantic Segmentation on ADE20K}

\begin{table}[t]
\centering
\caption{Semantic segmentation on ADE20K val.}
\label{tab:seg}
\small
\begin{tabular}{@{}lccr@{}}
\toprule
\textbf{Method} & \textbf{Backbone} & \textbf{mIoU} & \textbf{GFLOPs} \\
\midrule
SegFormer-B3 & MiT-B3 & 47.6 & 79.0 \\
Swin-L + UPerNet & Swin-L & 52.1 & 405 \\
Mask2Former & Swin-L & 56.1 & 466 \\
\midrule
ViT-L + UPerNet & ViT-L & 50.9 & 362 \\
EViT-L + UPerNet & ViT-L & 47.2 & 228 \\
ToMe-L + UPerNet & ViT-L & 48.1 & 246 \\
\textbf{DynaPrune-L + UPerNet} & ViT-L & \textbf{48.7} & \textbf{210} \\
\bottomrule
\end{tabular}
\end{table}

Table~\ref{tab:seg} shows that \textsc{DynaPrune-ViT} achieves the best efficiency-accuracy trade-off among token reduction methods, retaining 95.7\% of dense ViT performance while using only 58\% of the FLOPs.

\subsection{Object Detection on COCO}

\begin{table}[t]
\centering
\caption{Object detection on COCO val2017.}
\label{tab:det}
\small
\begin{tabular}{@{}lccc@{}}
\toprule
\textbf{Method} & \textbf{AP} & \textbf{AP\textsubscript{50}} & \textbf{GFLOPs} \\
\midrule
ViT-L + Cascade RCNN & 52.8 & 71.3 & 498 \\
EViT-L + Cascade RCNN & 49.1 & 68.2 & 321 \\
ToMe-L + Cascade RCNN & 49.8 & 69.1 & 338 \\
\textbf{DynaPrune-L + Cascade RCNN} & \textbf{51.3} & \textbf{70.5} & \textbf{308} \\
\bottomrule
\end{tabular}
\end{table}

On COCO, \textsc{DynaPrune-ViT} achieves 51.3 AP with 38\% fewer FLOPs than the dense baseline (Table~\ref{tab:det}), outperforming EViT and ToMe by significant margins.

\subsection{Ablation Study}

\begin{table}[t]
\centering
\caption{Ablation on ADE20K val.}
\label{tab:ablation}
\small
\begin{tabular}{@{}lcc@{}}
\toprule
\textbf{Configuration} & \textbf{mIoU} & \textbf{GFLOPs} \\
\midrule
Full model & 48.7 & 210 \\
Without deferred pruning & 46.3 & 208 \\
Without sparse-to-dense recon. & 44.1 & 204 \\
Uniform pruning (no learning) & 42.8 & 210 \\
Random pruning & 39.5 & 210 \\
\bottomrule
\end{tabular}
\end{table}

Table~\ref{tab:ablation} confirms that all components contribute meaningfully. Deferred pruning accounts for a 2.4 mIoU improvement, validating our hypothesis that early information aggregation before pruning is critical. The sparse-to-dense reconstruction module adds 4.6 mIoU over naive nearest-neighbor upsampling.

\subsection{Visualization}

We visualize token retention patterns and find that \textsc{DynaPrune-ViT} learns to keep tokens near object boundaries and complex texture regions while pruning homogeneous background areas. This spatial selectivity emerges without any explicit boundary supervision, demonstrating that the task loss alone provides sufficient signal for learning meaningful pruning patterns.

\section{Conclusion}

We presented \textsc{DynaPrune-ViT}, a token-efficient vision transformer designed for dense prediction tasks. Through importance-aware scoring, deferred pruning, and sparse-to-dense reconstruction, our approach achieves substantial computational savings while preserving the spatial structure necessary for pixel-level predictions. Experiments on semantic segmentation and object detection demonstrate state-of-the-art efficiency-accuracy trade-offs.

{\small
\bibliographystyle{plainnat}
\begin{thebibliography}{15}
\bibitem[Bolya et~al.(2023)]{bolya2023token}
D.~Bolya, C.-Y. Fu, X.~Dai, P.~Zhang, C.~Feichtenhofer, and J.~Hoffman.
\newblock Token merging: Your {ViT} but faster.
\newblock In \emph{Proc.\ ICLR}, 2023.

\bibitem[Cai et~al.(2023)]{cai2023efficientvit}
H.~Cai, C.~Gan, and S.~Han.
\newblock {EfficientViT}: Lightweight multi-scale attention for high-resolution dense prediction.
\newblock In \emph{Proc.\ ICCV}, 2023.

\bibitem[Cheng et~al.(2022)]{cheng2022masked}
B.~Cheng, I.~Misra, A.~Schwing, A.~Kirillov, and R.~Girdhar.
\newblock Masked-attention mask transformer for universal image segmentation.
\newblock In \emph{Proc.\ CVPR}, 2022.

\bibitem[Dosovitskiy et~al.(2021)]{dosovitskiy2021image}
A.~Dosovitskiy, L.~Beyer, A.~Kolesnikov, D.~Weissenborn, X.~Zhai, T.~Unterthiner, M.~Dehghani, M.~Minderer, G.~Heigold, S.~Gelly, J.~Uszkoreit, and N.~Houlsby.
\newblock An image is worth 16x16 words: Transformers for image recognition at scale.
\newblock In \emph{Proc.\ ICLR}, 2021.

\bibitem[Katharopoulos et~al.(2020)]{katharopoulos2020transformers}
A.~Katharopoulos, A.~Vyas, N.~Pappas, and F.~Fleuret.
\newblock Transformers are {RNNs}: Fast autoregressive transformers with linear attention.
\newblock In \emph{Proc.\ ICML}, 2020.

\bibitem[Liang et~al.(2022)]{liang2022evit}
Y.~Liang, C.~Ge, Z.~Tong, Y.~Song, J.~Wang, and P.~Xie.
\newblock Not all patches are what you need: Expediting vision transformers via token reorganizations.
\newblock In \emph{Proc.\ ICLR}, 2022.

\bibitem[Liu et~al.(2021)]{liu2021swin}
Z.~Liu, Y.~Lin, Y.~Cao, H.~Hu, Y.~Wei, Z.~Zhang, S.~Lin, and B.~Guo.
\newblock Swin transformer: Hierarchical vision transformer using shifted windows.
\newblock In \emph{Proc.\ ICCV}, 2021.

\bibitem[Rao et~al.(2021)]{rao2021dynamicvit}
Y.~Rao, W.~Zhao, B.~Liu, J.~Lu, J.~Zhou, and C.-J. Hsieh.
\newblock {DynamicViT}: Efficient vision transformers with dynamic token sparsification.
\newblock In \emph{Proc.\ NeurIPS}, 2021.

\bibitem[Wang et~al.(2021)]{wang2021pyramid}
W.~Wang, E.~Xie, X.~Li, D.-P. Fan, K.~Song, D.~Liang, T.~Lu, P.~Luo, and L.~Shao.
\newblock Pyramid vision transformer: A versatile backbone for dense prediction without convolutions.
\newblock In \emph{Proc.\ ICCV}, 2021.

\bibitem[Xie et~al.(2021)]{xie2021segformer}
E.~Xie, W.~Wang, Z.~Yu, A.~Anandkumar, J.~Alvarez, and P.~Luo.
\newblock {SegFormer}: Simple and efficient design for semantic segmentation with transformers.
\newblock In \emph{Proc.\ NeurIPS}, 2021.
\end{thebibliography}
}

\end{document}
Bibby Mascot

PDF Preview

Create an account to compile and preview

ECCV 2026 LaTeX Template | Free Download & Preview - Bibby