Templates

CVPR 2026

Preview

CVPR 2026

CVPR 2026 computer vision conference paper template with two-column format

Category

Conference

License

Free to use (MIT)

File

cvpr-2026/main.tex

main.texRead-only preview
\documentclass[10pt,twocolumn]{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage[margin=0.75in]{geometry}
\usepackage{amsmath,amssymb}
\usepackage{graphicx}
\usepackage{booktabs}
\usepackage{enumitem}
\usepackage{xcolor}
\usepackage{times}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{natbib}
\usepackage{hyperref}
\usepackage{url}
\usepackage{microtype}
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{multirow}

\setlength{\columnsep}{0.25in}

\title{\Large\bfseries Spatial Frequency Decomposition Networks for\\Robust Fine-Grained Visual Recognition}

\author{
  \textbf{Yuki Tanaka}\textsuperscript{1}\quad
  \textbf{Sarah Mitchell}\textsuperscript{2}\quad
  \textbf{Priya Krishnan}\textsuperscript{1}\quad
  \textbf{Lukas Stein}\textsuperscript{3}\\[4pt]
  \textsuperscript{1}Computer Vision Lab, ETH Z\"urich\quad
  \textsuperscript{2}Google DeepMind\quad
  \textsuperscript{3}Max Planck Institute for Informatics\\[2pt]
  {\small\texttt{\{tanaka,krishnan\}@vision.ee.ethz.ch, [email protected], [email protected]}}
}

\date{}

\begin{document}
\maketitle

\begin{abstract}
Fine-grained visual recognition remains challenging due to the subtle inter-class differences that distinguish similar categories. Existing approaches typically operate on spatial representations alone, overlooking valuable discriminative information present in the frequency domain. We propose \textsc{FreqDecomp-Net}, a novel architecture that explicitly decomposes input images into spatial frequency bands and learns specialized representations for each band before fusing them for classification. Our frequency decomposition module uses learnable wavelet-inspired filters that separate low-frequency structural information from high-frequency textural details. A cross-frequency attention mechanism then models interactions between frequency bands to capture discriminative patterns that span multiple scales. On CUB-200-2011, Stanford Cars, and FGVC-Aircraft, \textsc{FreqDecomp-Net} achieves 93.2\%, 96.1\%, and 95.4\% accuracy respectively, establishing new state-of-the-art results. Extensive ablations demonstrate that frequency decomposition provides complementary information to spatial features, particularly for categories distinguished by fine texture patterns.
\end{abstract}

\section{Introduction}

Fine-grained visual recognition (FGVR) aims to distinguish between subordinate categories within a broader class, such as species of birds \citep{wah2011cub}, models of cars \citep{krause2013cars}, or variants of aircraft \citep{maji2013aircraft}. Unlike coarse-grained recognition where objects differ in overall shape and appearance, FGVR demands attention to subtle differences in texture, pattern, and local structure.

Recent advances in FGVR have focused on learning discriminative part representations \citep{zhang2019learning}, attention mechanisms \citep{zheng2019looking}, and higher-order feature interactions \citep{lin2015bilinear}. These approaches operate primarily in the spatial domain, extracting features from RGB pixel values. However, substantial discriminative information resides in the \emph{frequency domain}. For instance, the fine striping patterns on a Zebra Finch versus a House Finch manifest as distinct high-frequency signatures, while overall body shape appears in low-frequency components.

We propose \textsc{FreqDecomp-Net}, which bridges spatial and frequency domain analysis for fine-grained recognition. Our key insight is that different frequency bands carry different types of discriminative information, and learning specialized representations for each band improves recognition of subtle inter-class differences.

Our contributions are:
\begin{itemize}[nosep,leftmargin=*]
  \item A learnable frequency decomposition module that separates images into multiple frequency bands with end-to-end optimizable filters.
  \item A cross-frequency attention mechanism that captures discriminative interactions across frequency scales.
  \item State-of-the-art results on three major fine-grained recognition benchmarks.
  \item Comprehensive analysis showing when frequency information is most beneficial.
\end{itemize}

\section{Related Work}

\paragraph{Fine-Grained Recognition.}
Early deep learning approaches to FGVR used part annotations to localize discriminative regions \citep{zhang2014part}. Bilinear pooling methods \citep{lin2015bilinear} capture second-order feature statistics without explicit part detection. Attention-based methods \citep{zheng2019looking,fu2017look} learn to focus on informative regions automatically. Our work is complementary, as frequency decomposition can be combined with any of these approaches.

\paragraph{Frequency Analysis in Vision.}
The use of frequency domain analysis in deep learning has gained renewed interest. \citet{xu2020learning} showed that CNNs exhibit frequency-dependent biases. \citet{qin2021fcanet} proposed frequency channel attention for general image classification. Unlike these works, we perform explicit multi-band decomposition with learnable filters tailored for fine-grained discrimination.

\section{Methodology}

\subsection{Architecture Overview}

Given an input image $\mathbf{I} \in \mathbb{R}^{H \times W \times 3}$, \textsc{FreqDecomp-Net} processes it through three stages: (1) frequency decomposition, (2) band-specific feature extraction, and (3) cross-frequency fusion.

\subsection{Learnable Frequency Decomposition}

We decompose the input into $K$ frequency bands using learnable filters. Let $\psi_k$ denote the $k$-th filter:
\begin{equation}
\mathbf{I}_k = \psi_k * \mathbf{I}, \quad k = 1, \ldots, K
\end{equation}

The filters are initialized as Gabor wavelets at different frequencies and orientations, then fine-tuned during training:
\begin{equation}
\psi_k^{(0)}(x, y) = \exp\!\left(-\frac{x'^2 + \gamma^2 y'^2}{2\sigma_k^2}\right) \cos\!\left(\frac{2\pi x'}{\lambda_k}\right)
\end{equation}
where $x' = x\cos\theta_k + y\sin\theta_k$ and $y' = -x\sin\theta_k + y\cos\theta_k$. The parameters $\{\sigma_k, \lambda_k, \theta_k, \gamma\}$ become learnable during training.

A reconstruction loss ensures completeness:
\begin{equation}
\mathcal{L}_{\text{recon}} = \left\| \mathbf{I} - \sum_{k=1}^{K} \mathbf{I}_k \right\|_F^2
\end{equation}

\subsection{Band-Specific Feature Extraction}

Each frequency band $\mathbf{I}_k$ is processed by a shared backbone $\phi$ (ResNet-50) with band-specific adaptation layers $\mathbf{g}_k$:
\begin{equation}
\mathbf{f}_k = g_k(\phi(\mathbf{I}_k)) \in \mathbb{R}^d
\end{equation}

\subsection{Cross-Frequency Attention}

We stack band features into a sequence $\mathbf{F} = [\mathbf{f}_1; \ldots; \mathbf{f}_K] \in \mathbb{R}^{K \times d}$ and apply multi-head cross-attention:
\begin{equation}
\mathbf{F}' = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_h}}\right)\mathbf{V} + \mathbf{F}
\end{equation}
where $\mathbf{Q} = \mathbf{F}\mathbf{W}_Q$, $\mathbf{K} = \mathbf{F}\mathbf{W}_K$, $\mathbf{V} = \mathbf{F}\mathbf{W}_V$.

The final representation is obtained by global average pooling over the $K$ band features, followed by a classification head.

\subsection{Training}

The total loss combines cross-entropy classification with the reconstruction regularizer:
\begin{equation}
\mathcal{L} = \mathcal{L}_{\text{CE}} + \alpha \mathcal{L}_{\text{recon}}
\end{equation}
with $\alpha = 0.01$ throughout our experiments.

\section{Experiments}

\subsection{Datasets and Setup}

We evaluate on three standard benchmarks:
\begin{itemize}[nosep,leftmargin=*]
  \item \textbf{CUB-200-2011} \citep{wah2011cub}: 200 bird species, 5,994 train / 5,794 test images.
  \item \textbf{Stanford Cars} \citep{krause2013cars}: 196 car models, 8,144 train / 8,041 test images.
  \item \textbf{FGVC-Aircraft} \citep{maji2013aircraft}: 100 aircraft variants, 6,667 train / 3,333 test images.
\end{itemize}

We use ResNet-50 pretrained on ImageNet as the backbone. Images are resized to $448 \times 448$. We train for 100 epochs using SGD with momentum 0.9, initial learning rate 0.01 with cosine annealing, batch size 16, and standard data augmentation.

\subsection{Main Results}

\begin{table}[t]
\centering
\caption{Comparison with state-of-the-art methods. Accuracy (\%).}
\label{tab:main}
\small
\begin{tabular}{@{}lccc@{}}
\toprule
\textbf{Method} & \textbf{CUB} & \textbf{Cars} & \textbf{Aircraft} \\
\midrule
ResNet-50 \citep{he2016deep} & 84.5 & 92.8 & 88.4 \\
Bilinear CNN \citep{lin2015bilinear} & 84.1 & 91.3 & 84.1 \\
MA-CNN \citep{zheng2019looking} & 86.5 & 92.8 & 89.9 \\
NTS-Net \citep{yang2018learning} & 87.5 & 93.9 & 91.4 \\
API-Net \citep{zhuang2020learning} & 90.0 & 95.3 & 93.9 \\
TransFG \citep{he2022transfg} & 91.7 & 94.8 & 93.1 \\
FFVT \citep{wang2021feature} & 91.6 & 95.5 & 94.2 \\
\midrule
\textbf{FreqDecomp-Net} & \textbf{93.2} & \textbf{96.1} & \textbf{95.4} \\
\bottomrule
\end{tabular}
\end{table}

Table~\ref{tab:main} shows that \textsc{FreqDecomp-Net} achieves state-of-the-art performance on all three benchmarks. The improvements are most pronounced on CUB-200 (+1.5\% over previous best), where fine texture differences (plumage patterns, beak shapes) are critical discriminative cues that benefit from explicit frequency analysis.

\subsection{Ablation Studies}

\begin{table}[t]
\centering
\caption{Ablation study on CUB-200-2011.}
\label{tab:ablation}
\small
\begin{tabular}{@{}lc@{}}
\toprule
\textbf{Configuration} & \textbf{Acc.} \\
\midrule
Spatial only (baseline) & 89.4 \\
+ Fixed frequency decomposition & 91.1 \\
+ Learnable decomposition & 91.8 \\
+ Cross-frequency attention & 92.6 \\
+ Reconstruction loss & \textbf{93.2} \\
\bottomrule
\end{tabular}
\end{table}

Table~\ref{tab:ablation} shows that each component contributes positively. Learnable decomposition improves over fixed Gabor filters by 0.7\%, confirming that task-specific frequency selection matters. Cross-frequency attention adds 0.8\%, demonstrating the value of modeling inter-band interactions.

\section{Visualization and Analysis}

We analyze which frequency bands are most discriminative for different fine-grained categories. For bird species, high-frequency bands (capturing feather patterns and edge textures) receive the highest attention weights (avg.\ 0.38), while for cars, mid-frequency bands (capturing grille designs and body curves) dominate (avg.\ 0.42). This confirms that different recognition tasks require different frequency sensitivities, motivating our learnable decomposition approach.

\section{Conclusion}

We introduced \textsc{FreqDecomp-Net}, a frequency-aware architecture for fine-grained visual recognition. By explicitly decomposing images into learnable frequency bands and modeling cross-frequency interactions, our method captures discriminative patterns that spatial-only approaches miss. State-of-the-art results on three benchmarks validate the effectiveness of frequency domain analysis for fine-grained recognition. Future directions include extending frequency decomposition to video-based fine-grained recognition and exploring its synergy with self-supervised pretraining.

{\small
\bibliographystyle{plainnat}
\begin{thebibliography}{20}
\bibitem[Fu et~al.(2017)]{fu2017look}
J.~Fu, H.~Zheng, and T.~Mei.
\newblock Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition.
\newblock In \emph{Proc.\ CVPR}, 2017.

\bibitem[He et~al.(2016)]{he2016deep}
K.~He, X.~Zhang, S.~Ren, and J.~Sun.
\newblock Deep residual learning for image recognition.
\newblock In \emph{Proc.\ CVPR}, 2016.

\bibitem[He et~al.(2022)]{he2022transfg}
J.~He, J.-N. Chen, S.~Liu, A.~Kortylewski, C.~Yang, Y.~Bai, and C.~Wang.
\newblock {TransFG}: A transformer architecture for fine-grained recognition.
\newblock In \emph{Proc.\ AAAI}, 2022.

\bibitem[Krause et~al.(2013)]{krause2013cars}
J.~Krause, M.~Stark, J.~Deng, and L.~Fei-Fei.
\newblock 3{D} object representations for fine-grained categorization.
\newblock In \emph{Proc.\ ICCV Workshops}, 2013.

\bibitem[Lin et~al.(2015)]{lin2015bilinear}
T.-Y. Lin, A.~RoyChowdhury, and S.~Maji.
\newblock Bilinear {CNN} models for fine-grained visual recognition.
\newblock In \emph{Proc.\ ICCV}, 2015.

\bibitem[Maji et~al.(2013)]{maji2013aircraft}
S.~Maji, E.~Rahtu, J.~Kannala, M.~Blaschko, and A.~Vedaldi.
\newblock Fine-grained visual classification of aircraft.
\newblock \emph{arXiv preprint arXiv:1306.5151}, 2013.

\bibitem[Qin et~al.(2021)]{qin2021fcanet}
Z.~Qin, P.~Zhang, F.~Wu, and X.~Li.
\newblock {FcaNet}: Frequency channel attention networks.
\newblock In \emph{Proc.\ ICCV}, 2021.

\bibitem[Wah et~al.(2011)]{wah2011cub}
C.~Wah, S.~Branson, P.~Welinder, P.~Perona, and S.~Belongie.
\newblock The {Caltech-UCSD} Birds-200-2011 Dataset.
\newblock Technical report, Caltech, 2011.

\bibitem[Wang et~al.(2021)]{wang2021feature}
J.~Wang, X.~Yu, and Y.~Gao.
\newblock Feature fusion vision transformer for fine-grained visual categorization.
\newblock In \emph{Proc.\ BMVC}, 2021.

\bibitem[Xu et~al.(2020)]{xu2020learning}
Z.~Xu, R.~Wilber, and M.~Fern.
\newblock Learning in the frequency domain.
\newblock In \emph{Proc.\ CVPR}, 2020.

\bibitem[Yang et~al.(2018)]{yang2018learning}
Z.~Yang, T.~Luo, D.~Wang, Z.~Hu, J.~Gao, and L.~Wang.
\newblock Learning to navigate for fine-grained classification.
\newblock In \emph{Proc.\ ECCV}, 2018.

\bibitem[Zhang et~al.(2014)]{zhang2014part}
N.~Zhang, J.~Donahue, R.~Girshick, and T.~Darrell.
\newblock Part-based {R-CNN}s for fine-grained category detection.
\newblock In \emph{Proc.\ ECCV}, 2014.

\bibitem[Zhang et~al.(2019)]{zhang2019learning}
F.~Zhang, M.~Li, G.~Zhai, and Y.~Liu.
\newblock Learning a mixture of granularity-specific experts for fine-grained categorization.
\newblock In \emph{Proc.\ ICCV}, 2019.

\bibitem[Zheng et~al.(2019)]{zheng2019looking}
H.~Zheng, J.~Fu, Z.-J. Zha, and J.~Luo.
\newblock Looking for the devil in the details: Learning trilinear attention sampling network for fine-grained image recognition.
\newblock In \emph{Proc.\ CVPR}, 2019.

\bibitem[Zhuang et~al.(2020)]{zhuang2020learning}
P.~Zhuang, Y.~Wang, and Y.~Qiao.
\newblock Learning attentive pairwise interaction for fine-grained classification.
\newblock In \emph{Proc.\ AAAI}, 2020.
\end{thebibliography}
}

\end{document}
Bibby Mascot

PDF Preview

Create an account to compile and preview

CVPR 2026 LaTeX Template | Free Download & Preview - Bibby