IJCAI

IJCAI proceedings paper using the official ijcai style. Two-column, numbered citations via named, abstract, main sections, references, and optional ethical-considerations section. Strict 6+1 page limit (7 pages with references only) enforced by IJCAI.

Category

Conference

License

Free to use (MIT)

File

ijcai/main.tex

main.texRead-only preview
\documentclass{article}
\pdfpagewidth=8.5in
\pdfpageheight=11in

% IJCAI 2024 style file. Swap for most recent year (ijcai25.sty, ijcai26.sty, etc.).
\usepackage{ijcai24}

% Use the postscript times font!
\usepackage{times}
\usepackage{soul}
\usepackage{url}
\usepackage[hidelinks]{hyperref}
\usepackage[small]{caption}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{booktabs}
\usepackage{algorithm}
\usepackage{algorithmic}

\urlstyle{same}

\newtheorem{theorem}{Theorem}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{proposition}[theorem]{Proposition}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}

\title{Planning in Partially Observable Environments\\
  via Learned State Abstractions}

% Single author version:
% \author{First Last \\ University of Example \\ [email protected]}

% Multi-author / multi-affiliation version:
\author{
First Last$^1$
\and
Jane Doe$^1$
\and
John Smith$^2$\\
\affiliations
$^1$Department of Computer Science, University of Example\\
$^2$Example Research Labs\\
\emails
\{you, jane\}@example.com,
[email protected]
}

\begin{document}
\maketitle

\begin{abstract}
We study planning in partially observable environments where an agent
must jointly learn state abstractions and policies from experience.
Classical POMDP planners scale poorly in the number of hidden states;
end-to-end deep RL requires orders of magnitude more samples. We propose
\textsc{LatentPlan}, a method that learns latent state representations
through variational inference and uses them for efficient model-based
planning. \textsc{LatentPlan} achieves an 18\% average improvement over
prior belief-space planners on six standard POMDP benchmarks with
$10\times$ fewer environment interactions.
\end{abstract}

\section{Introduction}
Partially observable Markov decision processes (POMDPs) are the standard
model for decision-making under uncertainty, but their computational
tractability has been the field's central obstacle. Recent progress in
representation learning~\cite{ha2018worldmodels} suggests that learned
abstractions can dramatically reduce the effective state space, but
integrating them with planning has proven difficult.

\paragraph{Contributions.}
\begin{itemize}
\item A variational objective that jointly trains a latent state
  encoder, a transition model, and a reward model suitable for
  model-based planning.
\item A belief-space planner operating in the learned latent space,
  with theoretical guarantees when the abstraction is Markov-sufficient.
\item Empirical evaluation on six POMDP benchmarks, showing a consistent
  gap over both belief-space planners and deep RL baselines.
\end{itemize}

\section{Related Work}
Belief-space planners such as POMCP~\cite{silver2010pomcp} and DESPOT
provide strong planning in known models. World models~\cite{ha2018worldmodels}
and Dreamer variants learn dynamics from experience but typically rely on
reactive policies. Our method combines the strengths of both lines.

\section{Preliminaries}
A POMDP is a tuple $(\mathcal{S}, \mathcal{A}, \mathcal{O}, T, O, R, \gamma)$
with states $s$, actions $a$, observations $o$, transition $T(s'|s,a)$,
observation $O(o|s)$, and reward $R(s,a)$. The agent maintains a belief
$b(s) = \Pr[s \mid h]$ from observation history $h$.

\section{Method}
\subsection{Latent Model}
We learn an encoder $q_\phi(z|h)$ mapping histories to latent states and
a transition model $p_\theta(z'|z,a)$.

\begin{equation}
\mathcal{L}_{\text{ELBO}} = \mathbb{E}_{q_\phi}\!\left[\log p_\theta(o|z)\right]
- \text{KL}(q_\phi(z|h)\,\|\,p(z|h_{-1})).
\end{equation}

\subsection{Planning}
Given a belief over latent states we plan via the cross-entropy method:
\begin{equation}
  a^* = \arg\max_a \mathbb{E}_{z \sim q_\phi(z|h)} [V_\pi(z, a)].
  \label{eq:plan}
\end{equation}

\begin{theorem}[Consistency]
If $q_\phi$ is Markov-sufficient, \textsc{LatentPlan} is an asymptotic
$\epsilon$-optimal planner in the original POMDP.
\end{theorem}

\begin{algorithm}[t]
\caption{\textsc{LatentPlan}}
\begin{algorithmic}[1]
\State Collect initial trajectories $\mathcal{D}$ via exploration
\For{each iteration}
  \State Train $(q_\phi, p_\theta)$ on $\mathcal{D}$ via~\eqref{eq:plan}
  \State Deploy CEM planner in latent space for $N$ episodes
  \State Add new trajectories to $\mathcal{D}$
\EndFor
\end{algorithmic}
\end{algorithm}

\section{Experiments}
\begin{table}[t]
\centering
\small
\begin{tabular}{lccc}
\toprule
Environment & POMCP & Dreamer & \textbf{Ours} \\
\midrule
Tag          & 82 &  88 & \textbf{97} \\
RockSample   & 68 &  79 & \textbf{92} \\
PocMan       & 54 &  71 & \textbf{88} \\
LightDark    & 71 &  74 & \textbf{90} \\
LaserTag     & 59 &  66 & \textbf{84} \\
Minigrid-4R  & 43 &  55 & \textbf{81} \\
\bottomrule
\end{tabular}
\caption{Success rate (\%) across POMDP benchmarks.}
\end{table}

\subsection{Ablations}
Disabling the latent-consistency term reduces performance by 14 points
on average, confirming that a disentangled latent space is the primary
source of gains.

\section{Conclusion}
We have presented a scalable approach to POMDP planning with learned
abstractions. Future work includes hierarchical extensions and
application to richer observation modalities.

\section*{Ethical Statement}
Our work addresses planning under uncertainty. We do not foresee direct
ethical concerns beyond those standard to reinforcement learning research.

\section*{Acknowledgments}
We thank the reviewers and collaborators at Example Labs.

\bibliographystyle{named}
\bibliography{refs}

\end{document}
Bibby Mascot

PDF Preview

Create an account to compile and preview

IJCAI LaTeX Template | Free Download & Preview - Bibby