ICLR Paper

ICLR conference paper using iclr2024_conference.sty. Two-column-free single-column ML paper style similar to NeurIPS. Supports anonymous submission and non-anonymous final versions.

Category

Conference

License

Free to use (MIT)

File

iclr/main.tex

main.texRead-only preview
\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage[margin=1.25in]{geometry}
\usepackage{times}
\usepackage{graphicx}
\usepackage{booktabs}
\usepackage{amsmath,amssymb,amsthm}
\usepackage{hyperref}
\usepackage{natbib}
\usepackage{microtype}
\usepackage{algorithm}
\usepackage{algorithmic}

\newtheorem{definition}{Definition}

\title{ICLR Paper Title Goes Here}
\author{Anonymous authors\\Paper under double-blind review}
\date{}

\begin{document}
\maketitle

\begin{abstract}
We introduce \emph{Gradient-Guided Consistency Regularization} (GGCR),
a training strategy that leverages gradient alignment between augmented
views to improve semi-supervised learning. Unlike standard consistency
regularization, which penalizes output divergence, GGCR additionally
enforces that the gradient directions of the loss with respect to the
shared encoder are consistent across views. On CIFAR-10 with 40 labels,
GGCR achieves 93.2\% accuracy, outperforming FixMatch by 1.4 points.
We provide theoretical justification linking gradient consistency to
flatness of the loss landscape, and demonstrate scalability to
ImageNet with 1\% labels.
\end{abstract}

\section{Introduction}\label{sec:intro}
Semi-supervised learning (SSL) aims to leverage large amounts of
unlabeled data alongside a small labeled set to improve generalization.
Consistency regularization~\citep{lecun1998gradient} has emerged as a
dominant paradigm: the model is encouraged to produce similar predictions
for different augmented views of the same input. Methods such as
MixMatch, FixMatch~\citep{sohn2020fixmatch}, and FlexMatch have
achieved impressive results by combining consistency with pseudo-labeling
and confidence thresholding.

Despite their success, existing consistency methods operate exclusively
at the output level---penalizing divergence between predicted
distributions. This output-level signal can be weak when the model is
poorly calibrated or when augmentations are too aggressive, leading to
noisy pseudo-labels and training instability. We argue that an
overlooked source of supervision lies in the \emph{gradient space}: if
two views of the same input should produce the same output, then the
gradient of the loss with respect to the encoder should point in a
similar direction for both views.

Building on this insight, we propose \emph{Gradient-Guided Consistency
Regularization} (GGCR). In addition to the standard output consistency
penalty, GGCR adds a regularization term that maximizes the cosine
similarity of encoder gradients across augmented pairs. This encourages
the model to learn features whose optimization landscape is flat with
respect to augmentation, a property connected to improved
generalization~\citep{foret2021sharpness}.

Our contributions are:
\begin{itemize}
  \item A gradient-consistency regularizer for semi-supervised learning
        that complements output-level consistency.
  \item Theoretical analysis connecting gradient alignment to the
        flatness of the loss landscape.
  \item Comprehensive experiments on CIFAR-10/100 and ImageNet-1\%
        achieving state-of-the-art accuracy.
  \item An open-source implementation with full training recipes.
\end{itemize}

\section{Related Work}\label{sec:related}
\paragraph{Consistency Regularization.}
Temporal Ensembling and Mean Teacher enforce consistency between the
current model and an exponential moving average of past
parameters~\citep{lecun1998gradient}. FixMatch~\citep{sohn2020fixmatch}
simplifies the pipeline by applying weak augmentations to generate
pseudo-labels and strong augmentations for consistency. Our method
extends this framework by adding gradient-level consistency.

\paragraph{Sharpness-Aware Optimization.}
SAM~\citep{foret2021sharpness} and its variants seek flat minima by
perturbing weights before computing gradients. GGCR shares the goal
of flat minima but achieves it through data-space augmentation
consistency rather than weight-space perturbation.

\paragraph{Gradient-Based Regularization.}
Jacobian regularization penalizes the norm of input-output Jacobians
to improve robustness. Our gradient consistency term differs in that it
operates on the loss gradient with respect to encoder parameters across
paired views, rather than on the Jacobian of the output with respect to
the input~\citep{he2016deep}.

\section{Preliminaries}\label{sec:prelim}
We consider a dataset $\mathcal{D} = \mathcal{D}_l \cup \mathcal{D}_u$
where $\mathcal{D}_l = \{(x_i, y_i)\}_{i=1}^{N_l}$ are labeled and
$\mathcal{D}_u = \{u_j\}_{j=1}^{N_u}$ are unlabeled, with
$N_u \gg N_l$.

\begin{definition}[Gradient Consistency]
Let $\mathcal{A}_w, \mathcal{A}_s$ be weak and strong augmentation
operators. For an encoder $f_\theta$ and loss $\ell$, the gradient
consistency of input $u$ is:
\[
  \mathrm{GC}(u; \theta) = \frac{
    \langle \nabla_\theta \ell(f_\theta(\mathcal{A}_w(u))),\;
            \nabla_\theta \ell(f_\theta(\mathcal{A}_s(u))) \rangle
  }{
    \|\nabla_\theta \ell(f_\theta(\mathcal{A}_w(u)))\| \cdot
    \|\nabla_\theta \ell(f_\theta(\mathcal{A}_s(u)))\|
  }.
\]
\end{definition}

A value of $\mathrm{GC}(u;\theta)$ close to 1 indicates that the
optimization landscape is locally consistent under augmentation.

\section{Method}\label{sec:method}
The total GGCR objective combines supervised, unsupervised, and gradient
consistency losses:
\begin{equation}
  \mathcal{L}(\theta) = \mathcal{L}_s + \mu \mathcal{L}_u
  + \lambda \mathcal{L}_g,
\end{equation}
where $\mathcal{L}_s$ is cross-entropy on labeled data,
$\mathcal{L}_u$ is the FixMatch consistency loss, and
$\mathcal{L}_g$ is the gradient consistency penalty:
\begin{equation}
  \mathcal{L}_g = -\frac{1}{|\mathcal{B}_u|}
  \sum_{u \in \mathcal{B}_u} \mathrm{GC}(u; \theta).
\end{equation}

The full loss $\mathcal{L}(\theta)$ also includes the original
regularization:
\begin{equation}
  \mathcal{L}_{\mathrm{reg}}(\theta)
  = \mathbb{E}_{x \sim p(x)}\left[\|f_\theta(x) - y\|^2\right]
  + \lambda\, \Omega(\theta).
\end{equation}

\begin{algorithm}[t]
\caption{GGCR Training Step}\label{alg:ggcr}
\begin{algorithmic}[1]
\REQUIRE Labeled batch $\mathcal{B}_l$, unlabeled batch $\mathcal{B}_u$,
  thresholds $\tau$, weights $\mu, \lambda$
\STATE Compute $\mathcal{L}_s$ on $\mathcal{B}_l$
\FOR{each $u \in \mathcal{B}_u$}
  \STATE $\hat{y} \leftarrow \arg\max f_\theta(\mathcal{A}_w(u))$
    with confidence $> \tau$
  \STATE $\mathcal{L}_u \mathrel{+}= \ell(\hat{y}, f_\theta(\mathcal{A}_s(u)))$
  \STATE $g_w \leftarrow \nabla_\theta \ell(f_\theta(\mathcal{A}_w(u)))$
  \STATE $g_s \leftarrow \nabla_\theta \ell(f_\theta(\mathcal{A}_s(u)))$
  \STATE $\mathcal{L}_g \mathrel{+}= 1 - \mathrm{cos}(g_w, g_s)$
\ENDFOR
\STATE $\mathcal{L} \leftarrow \mathcal{L}_s + \mu \mathcal{L}_u
  + \lambda \mathcal{L}_g$
\STATE Update $\theta \leftarrow \theta - \eta \nabla_\theta \mathcal{L}$
\end{algorithmic}
\end{algorithm}

\subsection{Training}
We follow the FixMatch training protocol and add the gradient
consistency term after a warm-up period of 1\,000 steps. Gradient
computation for $\mathcal{L}_g$ uses \texttt{torch.autograd.grad}
with \texttt{create\_graph=True} to allow second-order gradients.

\subsection{Inference}
At inference time, GGCR introduces no additional cost: the gradient
consistency term is only used during training. Predictions are made
with a single forward pass through the encoder and classifier head.

\section{Experiments}\label{sec:exp}
\paragraph{Setup.} We evaluate on CIFAR-10 (40/250/4000 labels),
CIFAR-100 (400/2500 labels), and ImageNet with 1\% labels ($\sim$12.8K).
All models use a WideResNet-28-2 backbone for CIFAR and ResNet-50 for
ImageNet. We train with SGD (momentum 0.9, weight decay $5\times10^{-4}$,
Nesterov) and a cosine schedule over $2^{20}$ steps.

\paragraph{Results.} Quantitative results are in Table~\ref{tab:iclr-main}.

\begin{table}[t]
\centering
\caption{Main results (accuracy \%).}\label{tab:iclr-main}
\begin{tabular}{lccc}
\toprule
Model & CIFAR-10 (40) & CIFAR-100 (400) & ImageNet (1\%) \\
\midrule
FixMatch       & 91.8 & 53.6 & 57.2 \\
FlexMatch      & 92.4 & 54.9 & 58.1 \\
GGCR (ours)    & \textbf{93.2} & \textbf{56.8} & \textbf{60.3} \\
\bottomrule
\end{tabular}
\end{table}

\begin{figure}[t]
  \centering
  \fbox{\parbox{0.85\columnwidth}{\centering\vspace{2em}%
    Training curves: accuracy vs.\ step for FixMatch, FlexMatch, and
    GGCR on CIFAR-10 with 40 labels.\vspace{2em}}}
  \caption{Learning curves on CIFAR-10 (40 labels). GGCR converges
    faster and reaches higher final accuracy.}
  \label{fig:curves}
\end{figure}

\section{Ablation}\label{sec:ablation}
We ablate key components on CIFAR-10 with 40 labels.

\begin{table}[t]
\centering
\caption{Ablation results on CIFAR-10 (40 labels).}\label{tab:ablation}
\begin{tabular}{lc}
\toprule
Variant & Accuracy (\%) \\
\midrule
GGCR (full)                      & \textbf{93.2} \\
w/o gradient consistency         & 91.8 \\
w/o warm-up                      & 92.1 \\
Gradient norm penalty instead    & 92.5 \\
\bottomrule
\end{tabular}
\end{table}

The gradient consistency term provides a 1.4 point improvement over the
FixMatch baseline, and the warm-up strategy contributes an additional
1.1 points by preventing noisy gradient signals early in training.

\begin{table}[t]
\centering
\caption{Hyperparameter sensitivity on CIFAR-10 (40 labels).}\label{tab:hparams}
\begin{tabular}{lccccc}
\toprule
$\lambda$ & 0.01 & 0.05 & 0.1 & 0.5 & 1.0 \\
\midrule
Accuracy  & 92.3 & 92.8 & \textbf{93.2} & 92.9 & 91.7 \\
\bottomrule
\end{tabular}
\end{table}

\section{Discussion}
GGCR provides a complementary regularization signal that is particularly
effective in the low-label regime. When more labels are available
(CIFAR-10 with 4000 labels), the improvement narrows to 0.3 points,
suggesting that gradient consistency primarily helps when the supervised
signal is weak.

\paragraph{Limitations.}
The gradient consistency term requires computing per-sample gradients
for the encoder, which increases memory usage by approximately 40\%
compared to standard FixMatch. Gradient checkpointing can mitigate this
at the cost of additional computation. The method also assumes that
augmentation-invariant gradients are desirable, which may not hold for
tasks where augmentation changes the semantics (e.g., color-sensitive
classification).

\section{Conclusion}
We presented Gradient-Guided Consistency Regularization (GGCR), a
semi-supervised learning method that enforces gradient alignment across
augmented views. GGCR achieves state-of-the-art results on standard
SSL benchmarks, with the largest gains in extremely low-label settings.
Future work will explore efficient gradient estimation via random
projections and extension to other SSL paradigms such as self-training.

\section*{Broader Impact}
Semi-supervised learning methods like GGCR reduce the labeling burden,
potentially democratizing access to high-quality models in
resource-constrained settings such as medical imaging and agricultural
monitoring. However, models trained with limited labels may exhibit
biases not caught during evaluation on the small labeled set. We
recommend auditing predictions on held-out demographic subgroups before
deployment.

\section*{Reproducibility Statement}
All experiments use publicly available datasets and standard
architectures. We report exact hyperparameters, random seeds, and
hardware specifications in Section~\ref{sec:exp} and
Appendix~\ref{app:details}. Our code is available at
\url{https://github.com/anonymous/ggcr}.

\appendix
\section{Proof of Proposition 1}\label{app:proof}
\begin{proof}
Under $L$-smoothness of $\ell$ and $\sigma$-bounded augmentation noise,
the expected gradient consistency satisfies:
\[
  \mathbb{E}[\mathrm{GC}(u; \theta)] \geq 1 -
  \frac{L^2 \sigma^2}{\|\nabla_\theta \ell\|^2}.
\]
This shows that gradient consistency is high when the loss landscape is
flat (small $L$) relative to the gradient magnitude, connecting GGCR to
sharpness-aware optimization.
\end{proof}

\section{Additional Experimental Details}\label{app:details}
For CIFAR experiments we use standard augmentation: random horizontal
flip and crop with 4-pixel padding for weak augmentation; RandAugment
with $N\!=\!2, M\!=\!10$ for strong augmentation. The confidence
threshold is $\tau = 0.95$ and the unsupervised loss weight ramps from 0
to $\mu = 1.0$ over the first 5\,000 steps.

\bibliographystyle{iclr2024_conference}
\begin{thebibliography}{1}
\bibitem[LeCun et~al.(1998)]{lecun1998gradient}
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P.
\newblock Gradient-based learning applied to document recognition.
\newblock \emph{Proceedings of the IEEE}, 86(11):2278--2324, 1998.

\bibitem[Sohn et~al.(2020)]{sohn2020fixmatch}
Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C., Cubuk, E.~D., Kurakin, A., and Li, C.-L.
\newblock FixMatch: Simplifying semi-supervised learning with consistency and confidence.
\newblock In \emph{NeurIPS}, 2020.

\bibitem[Foret et~al.(2021)]{foret2021sharpness}
Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B.
\newblock Sharpness-aware minimization for efficiently improving generalization.
\newblock In \emph{ICLR}, 2021.

\bibitem[He et~al.(2016)]{he2016deep}
He, K., Zhang, X., Ren, S., and Sun, J.
\newblock Deep residual learning for image recognition.
\newblock In \emph{CVPR}, 2016.
\end{thebibliography}

\end{document}
Bibby Mascot

PDF Preview

Create an account to compile and preview

ICLR Paper LaTeX Template | Free Download & Preview - Bibby