Templates

NeurIPS Paper

Preview

NeurIPS Paper

NeurIPS conference paper format (neurips_2023.sty). Single-column, anonymizable for submission, with a distinctive section styling. Used by NeurIPS, adapted for ICLR and other ML conferences.

Category

Conference

License

Free to use (MIT)

File

neurips/main.tex

main.texRead-only preview
\documentclass{article}

% Ready for submission: use [preprint], [final] for camera-ready.
% \usepackage{neurips_2023}
% For this standalone version we emulate the NeurIPS style minimally:
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage[margin=1.5in]{geometry}
\usepackage{times}
\usepackage{microtype}
\usepackage{graphicx}
\usepackage{booktabs}
\usepackage{amsmath,amssymb}
\usepackage{hyperref}
\usepackage{natbib}
\usepackage{xcolor}
\usepackage{enumitem}
\usepackage{algorithm}
\usepackage{algorithmic}
\hypersetup{colorlinks,citecolor=blue,linkcolor=red,urlcolor=blue}

\title{Paper Title: Short, Punchy, Informative}

\author{%
  First Author\thanks{Equal contribution.} \\
  Department of X\\
  University of Y\\
  \texttt{[email protected]} \\
  \And
  Second Author\footnotemark[1] \\
  Department of X\\
  University of Y\\
  \texttt{[email protected]} \\
}

\begin{document}
\maketitle

\begin{abstract}
We propose \emph{Spectral Gated Networks} (SGN), a family of
architectures that replace standard dense projections with
frequency-domain gating mechanisms. By operating in the spectral
domain, SGN captures long-range dependencies at a fraction of the
computational cost of self-attention. On language modeling benchmarks
(WikiText-103, C4) and image classification tasks (ImageNet-1K,
CIFAR-100), SGN matches or exceeds Transformer baselines while
reducing FLOPs by 35\%. We provide convergence analysis for the
spectral gating operator and release code and pre-trained checkpoints.
\end{abstract}

\section{Introduction}
Transformer architectures have become the de facto standard across
natural language processing, computer vision, and scientific computing.
Their success hinges on the self-attention mechanism, which models
pairwise interactions among all input tokens. However, self-attention
incurs $O(n^2)$ complexity in sequence length $n$, limiting
applicability to long-context settings such as genomics, document
understanding, and high-resolution image analysis.

Several lines of work attempt to address this bottleneck.
Sparse attention patterns~\citep{vaswani2017attention} reduce the
number of attended pairs, while kernel-based
approximations~\citep{kingma2014auto} linearize the attention
operator. Although effective in certain regimes, these methods
sacrifice modeling capacity: sparse patterns miss critical long-range
interactions, and low-rank kernels under-represent the full attention
matrix.

We take a different approach rooted in signal processing. Our key
observation is that many of the interactions captured by self-attention
correspond to low-frequency components in the token embedding space.
By projecting inputs into the frequency domain via the Discrete Cosine
Transform (DCT), we can selectively gate spectral coefficients to
retain informative interactions while discarding high-frequency noise.
The resulting \emph{Spectral Gated Network} (SGN) achieves $O(n \log n)$
complexity and matches Transformer quality across modalities.

Our contributions are:
\begin{itemize}[itemsep=2pt]
  \item A spectral gating layer that replaces self-attention with
        frequency-domain filtering in $O(n\log n)$ time.
  \item Theoretical analysis showing SGN preserves the expressive power
        of full attention under a smoothness assumption on the input
        distribution.
  \item State-of-the-art results on four benchmarks with 35\% fewer FLOPs.
\end{itemize}

\section{Background and Related Work}\label{sec:related}

\paragraph{Efficient Transformers.}
Linear attention methods approximate the softmax kernel with feature
maps, achieving $O(n)$ complexity but often degrading perplexity on
language tasks~\citep{goodfellow2014generative}. Sparse attention
methods such as Longformer and BigBird use fixed or learned sparsity
patterns, trading off between coverage and efficiency.

\paragraph{Spectral Methods in Deep Learning.}
Fourier-domain operations have been explored for image
processing~\citep{kingma2014auto} and sequence modeling. FNet replaces
attention with unparameterized Fourier transforms but lacks a gating
mechanism, limiting expressiveness. Our work introduces learnable
spectral gates that adapt during training.

\section{Method}\label{sec:method}
Let $X \in \mathbb{R}^{n \times d}$ be the input matrix. We define the
spectral gating layer as:
\begin{equation}
  \mathrm{SGL}(X) = \mathrm{IDCT}\!\left(G \odot \mathrm{DCT}(X)\right),
\end{equation}
where $G \in \mathbb{R}^{n \times d}$ is a learnable gating matrix and
$\odot$ denotes element-wise multiplication. The DCT and its inverse are
computed in $O(n \log n)$ via the Fast Cosine Transform.

The full SGN block combines spectral gating with a position-wise
feed-forward network:
\begin{equation}
  \theta^\star = \arg\min_\theta \mathbb{E}_{(x,y)\sim \mathcal{D}}\left[\ell(f_\theta(x), y)\right].
\end{equation}

\begin{figure}[t]
  \centering
  \fbox{\parbox{0.85\columnwidth}{\centering\vspace{2em}%
    Block diagram of the Spectral Gated Network layer: input $X$ is
    transformed via DCT, element-wise gated, inverse-transformed, then
    passed through a feed-forward network.\vspace{2em}}}
  \caption{Architecture of a single SGN block. The spectral gating path
    (top) replaces the self-attention sub-layer of a standard
    Transformer.}
  \label{fig:sgn-block}
\end{figure}

\begin{algorithm}[t]
\caption{SGN Forward Pass}\label{alg:sgn}
\begin{algorithmic}[1]
\REQUIRE Input $X \in \mathbb{R}^{n \times d}$, gate $G$
\STATE $S \leftarrow \mathrm{DCT}(X)$ \hfill \textit{// spectral projection}
\STATE $\hat{S} \leftarrow G \odot S$ \hfill \textit{// gating}
\STATE $H \leftarrow \mathrm{IDCT}(\hat{S})$ \hfill \textit{// reconstruction}
\STATE $Y \leftarrow \mathrm{FFN}(\mathrm{LayerNorm}(H + X))$
\RETURN $Y$
\end{algorithmic}
\end{algorithm}

\section{Experiments}\label{sec:exp}
\paragraph{Datasets.} We evaluate on WikiText-103 (language modeling),
C4 (large-scale LM), CIFAR-100, and ImageNet-1K (image classification).

\paragraph{Baselines.} We compare against the standard Transformer,
FNet, Linear Transformer, and Performer.

\paragraph{Implementation.} All models use 12 layers, hidden dimension
768, and 12 attention heads (for attention-based baselines). We train
with AdamW ($\beta_1\!=\!0.9$, $\beta_2\!=\!0.98$, weight decay $0.01$)
and a cosine learning rate schedule peaking at $3\times10^{-4}$.
Experiments use 4$\times$A100 GPUs with batch size 256. Each run is
repeated with three seeds.

\begin{table}[t]
\centering
\caption{Results on CIFAR-10 and ImageNet.}
\begin{tabular}{lcc}
\toprule
Method & CIFAR-10 & ImageNet \\
\midrule
Baseline & 92.3 & 76.1 \\
Ours     & \textbf{94.7} & \textbf{78.4} \\
\bottomrule
\end{tabular}
\end{table}

\begin{table}[t]
\centering
\caption{Language modeling perplexity (lower is better).}\label{tab:lm}
\begin{tabular}{lcc}
\toprule
Method & WikiText-103 & C4 \\
\midrule
Transformer   & 18.3 & 15.7 \\
FNet          & 22.1 & 19.4 \\
Linear Trans. & 20.8 & 17.9 \\
SGN (ours)    & \textbf{18.1} & \textbf{15.5} \\
\bottomrule
\end{tabular}
\end{table}

\subsection{Ablation Study}\label{sec:ablation}
We ablate key design choices on WikiText-103 to isolate the contribution
of each component.

\begin{table}[t]
\centering
\caption{Ablation study on WikiText-103.}\label{tab:ablation}
\begin{tabular}{lc}
\toprule
Variant & Perplexity \\
\midrule
SGN (full)                   & \textbf{18.1} \\
w/o learnable gates (fixed)  & 20.4 \\
w/o residual connection      & 19.7 \\
FFT instead of DCT           & 18.6 \\
\bottomrule
\end{tabular}
\end{table}

Learnable gates provide the largest gain (2.3 perplexity points),
confirming that adaptive spectral filtering is essential.

\section{Discussion}
SGN demonstrates that frequency-domain operations can serve as a
drop-in replacement for self-attention without sacrificing quality.
The method is most effective on tasks with long-range dependencies,
where the spectral bias of DCT aligns well with the structure of the
signal.

\paragraph{Limitations.}
The current DCT-based gating assumes a fixed sequence length at
initialization. Variable-length inputs require padding or chunking,
which may introduce boundary artifacts. Additionally, the spectral
interpretation is less clear for discrete, non-stationary token
sequences such as code or structured data. Future work will explore
adaptive basis selection and multi-resolution gating.

\section{Conclusion}
We introduced Spectral Gated Networks, a family of efficient
architectures that replace self-attention with learnable
frequency-domain gating. SGN matches Transformer quality on language
and vision benchmarks while reducing computation by 35\%. Theoretical
analysis confirms that spectral gating preserves the expressive capacity
of full attention under smoothness assumptions.

\section*{Broader Impact}
Efficient sequence models reduce the energy cost of training and
inference, making large-scale modeling more accessible to
resource-constrained institutions. However, more efficient models may
also lower the barrier to generating harmful content at scale. We
encourage the community to pair efficiency gains with robust safety
and alignment mechanisms.

\section*{Reproducibility Statement}
All experiments are conducted with publicly available datasets and
standard hardware. Hyperparameters, random seeds, and training
schedules are reported in Section~\ref{sec:exp}. Source code, training
scripts, and pre-trained checkpoints will be released at
\url{https://github.com/anonymous/sgn} upon acceptance.

\appendix
\section{Proof of Theorem 1}\label{app:proof}
\begin{proof}
Let $\sigma_{\max}$ denote the largest singular value of the attention
matrix $A$. Under the $L$-smoothness assumption on the input
distribution, the spectral gating operator $G \odot \mathrm{DCT}(X)$
approximates $AX$ with error bounded by:
\[
  \|AX - \mathrm{IDCT}(G \odot \mathrm{DCT}(X))\|_F
  \leq \frac{L \sigma_{\max}}{K} \|X\|_F,
\]
where $K$ is the number of retained spectral coefficients. As
$K \to n$, the approximation becomes exact.
\end{proof}

\section{Additional Experimental Details}\label{app:details}
For ImageNet experiments, we use standard data augmentation (random
crop, horizontal flip, color jitter) and train for 300 epochs. For
WikiText-103, we use a context length of 512 tokens and evaluate with
a stride of 256. Full hyperparameter tables are provided in
Table~\ref{tab:hparams}.

\begin{table}[h]
\centering
\caption{Hyperparameters for all experiments.}\label{tab:hparams}
\begin{tabular}{lcccc}
\toprule
 & WikiText-103 & C4 & CIFAR-100 & ImageNet \\
\midrule
Layers       & 12  & 12  & 12  & 12  \\
Hidden dim   & 768 & 768 & 384 & 768 \\
Learning rate & 3e-4 & 3e-4 & 1e-3 & 1e-3 \\
Batch size   & 256 & 256 & 128 & 256 \\
Epochs       & 100 & 50  & 200 & 300 \\
\bottomrule
\end{tabular}
\end{table}

\bibliographystyle{plainnat}
\begin{thebibliography}{9}
\bibitem[Goodfellow et~al.(2014)]{goodfellow2014generative}
Goodfellow, I., et~al.
\newblock Generative adversarial nets.
\newblock In \emph{NeurIPS}, 2014.

\bibitem[Kingma and Welling(2014)]{kingma2014auto}
Kingma, D.~P. and Welling, M.
\newblock Auto-encoding variational bayes.
\newblock In \emph{ICLR}, 2014.

\bibitem[Vaswani et~al.(2017)]{vaswani2017attention}
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.~N., Kaiser, {\L}., and Polosukhin, I.
\newblock Attention is all you need.
\newblock In \emph{NeurIPS}, 2017.

\bibitem[He et~al.(2016)]{he2016deep}
He, K., Zhang, X., Ren, S., and Sun, J.
\newblock Deep residual learning for image recognition.
\newblock In \emph{CVPR}, 2016.
\end{thebibliography}

\end{document}
Bibby Mascot

PDF Preview

Create an account to compile and preview

NeurIPS Paper LaTeX Template | Free Download & Preview - Bibby