ICCV

ICCV paper using the official cvpr/iccv shared style. Double-column, 8-page + references limit, blind review mode toggle, proper figure/subfigure handling, and full main-conference paper structure (intro, related work, method, experiments, ablation, references).

Category

Conference

License

Free to use (MIT)

File

iccv/main.tex

main.texRead-only preview
\documentclass[10pt,twocolumn,letterpaper]{article}

% ICCV style (shared with CVPR). For final (camera-ready) use \iccvfinalcopy.
\usepackage{iccv}
% \iccvfinalcopy % *** Uncomment for camera-ready submission

\usepackage{times}
\usepackage{epsfig}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{booktabs}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{multirow}
\usepackage{algorithm}
\usepackage{algorithmic}

% Include other packages here, before hyperref.
\usepackage[pagebackref=true,breaklinks=true,letterpaper=true,colorlinks,bookmarks=false]{hyperref}

\iccvPaperID{1234} % *** Enter the ICCV Paper ID here
\def\httilde{\mbox{\tt\raisebox{-.5ex}{\symbol{126}}}}

% Pages are numbered in submission mode, and un-numbered in camera-ready
\ificcvfinal\pagestyle{empty}\fi

\begin{document}

%%%%%%%%% TITLE
\title{Mask-Guided Diffusion for Compositional Scene Generation}

\author{First Last\\
University of Example\\
City, Country\\
{\tt\small [email protected]}
% For a paper whose authors are all at the same institution,
% omit the following lines up until the closing ``}''.
% Additional authors and addresses can be added with ``\and'',
% just like the second author.
\and
Jane Doe\\
Example Research Labs\\
City, Country\\
{\tt\small [email protected]}
\and
John Smith\\
University of Example\\
{\tt\small [email protected]}
}

\maketitle

%%%%%%%%% ABSTRACT
\begin{abstract}
We propose a mask-guided diffusion model for compositional scene
generation that produces images from paired layout masks and captions.
Existing controllable-generation methods handle either layout or caption
cleanly but struggle with joint conditioning, producing artefacts at
object boundaries or ignoring the layout on complex prompts. Our method
decouples the two signals through separate cross-attention branches,
enabling fine-grained control over both spatial composition and
semantic content. On COCO-Stuff, we achieve 12.4 FID improvement over
the strongest prior baseline while simultaneously improving
caption--image alignment by 3 points in CLIP similarity. Code and
pretrained weights are available at \url{https://example.com/maskdiff}.
\end{abstract}

%%%%%%%%% BODY TEXT
\section{Introduction}
Generating images from text alone offers little control over spatial
composition, yet users frequently have a layout in mind. Prior work on
controllable generation~\cite{zhang2023controlnet,li2023gligen} addresses
this via spatial conditioning but often at the cost of caption fidelity.

We introduce a method that accepts both a caption and a layout mask,
generating scenes that respect both simultaneously. The key insight is
that layout and caption should not share the same cross-attention
pathway: doing so creates a race for attention budget and produces the
failure modes we observe in prior work.

\paragraph{Contributions.}
(1) A dual-branch cross-attention architecture for mask-and-caption
conditioning; (2) a layout-dropout training regimen that improves
mask adherence at inference time; (3) state-of-the-art FID and CLIP
scores on COCO-Stuff and LVIS layout-to-image benchmarks.

\section{Related Work}
\paragraph{Diffusion models.}
DDPMs~\cite{ho2020ddpm} and latent diffusion~\cite{rombach2022ldm} have
become the default for generative modeling of images.

\paragraph{Controllable generation.}
ControlNet~\cite{zhang2023controlnet}, T2I-Adapter, and
GLIGEN~\cite{li2023gligen} add spatial control channels on top of
pretrained diffusion. Our work improves on the cross-attention design
underlying these methods.

\paragraph{Layout-to-image.}
Prior layout-to-image methods include LAMA, LostGAN, and LDM-S.
They typically omit caption conditioning entirely.

\section{Method}
\subsection{Architecture}
We keep the base UNet of a pretrained latent diffusion model and add a
second set of cross-attention layers processing layout tokens. Layout
tokens carry both object-class embeddings and spatial coordinate
encodings.

\subsection{Training Objective}
We train with classifier-free guidance over both conditions:
\begin{equation}
\hat\epsilon = \epsilon_\theta(z_t, t, \varnothing) + s \cdot \big(\epsilon_\theta(z_t, t, c, m) - \epsilon_\theta(z_t, t, \varnothing)\big).
\label{eq:cfg}
\end{equation}
During training, caption and layout are dropped independently with
probability $p_c=p_m=0.1$.

\begin{figure}[t]
\centering
\rule{0.9\linewidth}{3cm} % placeholder
\caption{Overview of the mask-guided diffusion pipeline: a pretrained
UNet is augmented with a second cross-attention branch receiving layout
tokens derived from the mask.}
\label{fig:pipeline}
\end{figure}

\section{Experiments}
\subsection{Setup}
We fine-tune from Stable Diffusion XL on COCO-Stuff with mask
annotations. All baselines use the same backbone weights.

\begin{table}[t]
\centering
\small
\begin{tabular}{lccc}
\toprule
Method & FID$\downarrow$ & CLIP-sim$\uparrow$ & IoU$\uparrow$ \\
\midrule
SDXL                  & 23.4 & 0.29 & 0.41 \\
ControlNet~\cite{zhang2023controlnet}  & 18.9 & 0.28 & 0.64 \\
GLIGEN~\cite{li2023gligen}             & 17.2 & 0.30 & 0.68 \\
\textbf{Ours}         & \textbf{11.8} & \textbf{0.32} & \textbf{0.76} \\
\bottomrule
\end{tabular}
\caption{Quantitative results on COCO-Stuff.}
\label{tab:main}
\end{table}

\subsection{Ablations}
Disabling dual-branch cross-attention loses 4.3 FID points. Removing
layout dropout during training costs 6.1 IoU at inference.

\section{Limitations and Discussion}
Our method requires mask annotations at training time. Performance
degrades on classes unseen in COCO-Stuff, suggesting a need for
layout-caption alignment in broader training corpora.

\section{Conclusion}
Decoupled conditioning is a simple and effective recipe for controllable
compositional generation, delivering both layout fidelity and caption
adherence.

{\small
\bibliographystyle{ieee_fullname}
\bibliography{refs}
}

\end{document}
Bibby Mascot

PDF Preview

Create an account to compile and preview

ICCV LaTeX Template | Free Download & Preview - Bibby