SOSP

ACM SOSP paper using acmart sigconf. Double-column, 12-page body + unlimited references, supports anonymous review mode. Includes full systems-paper structure: motivation measurements, detailed design, implementation notes, evaluation, and artifact availability statement.

Category

Conference

License

Free to use (MIT)

File

sosp/main.tex

main.texRead-only preview
% SOSP paper template using acmart sigconf. Remove "review,anonymous"
% options for camera-ready submission, and set appropriate metadata.
\documentclass[sigconf,screen,review,anonymous]{acmart}

\usepackage{graphicx}
\usepackage{amsmath,amssymb}
\usepackage{booktabs}
\usepackage{algorithm}
\usepackage{algpseudocode}
\usepackage{listings}
\usepackage{xcolor}

\lstset{
  basicstyle=\ttfamily\footnotesize,
  columns=fullflexible,
  breaklines=true,
  showstringspaces=false,
  keywordstyle=\color{blue},
  commentstyle=\color{gray}\itshape,
  stringstyle=\color{red!60!black},
}

\AtBeginDocument{%
  \providecommand\BibTeX{{%
    \normalfont B\kern-0.5em{\scshape i\kern-0.25em b}\kern-0.8em\TeX}}}

\acmConference[SOSP '26]{ACM Symposium on Operating Systems Principles}{October 12--16, 2026}{City, Country}
\acmISBN{978-1-4503-XXXX-X/26/10}
\acmDOI{10.1145/XXXXXXX.XXXXXXX}

\setcopyright{acmlicensed}
\copyrightyear{2026}
\acmYear{2026}

\begin{document}

\title{Persephone: A Low-Overhead Rollback Recovery System\\
       for Microservice Architectures}

\author{First Last}
\affiliation{\institution{University of Example}\city{City}\country{Country}}
\email{[email protected]}

\author{Jane Doe}
\affiliation{\institution{Example Research Labs}\country{Country}}
\email{[email protected]}

\author{John Smith}
\affiliation{\institution{University of Example}\country{Country}}
\email{[email protected]}

\renewcommand{\shortauthors}{First Last, Jane Doe, and John Smith}

\begin{abstract}
Microservice architectures make localized failures easy but make
consistent recovery hard. We present Persephone, a rollback recovery
system that uses deterministic replay with a novel low-overhead logging
scheme tailored to RPC-based microservice meshes. Persephone achieves
1.6\% median runtime overhead and recovers a 200-node system in under
30 seconds, versus 8.2\% and 90s for the next-best prior work.
Persephone has been running in two production environments for 11 months
and has performed recovery for three real incidents without data loss.
\end{abstract}

\begin{CCSXML}
<ccs2012>
<concept><concept_id>10011007.10010940</concept_id>
<concept_desc>Software and its engineering~Software fault tolerance</concept_desc>
<concept_significance>500</concept_significance></concept>
<concept><concept_id>10011007.10011074.10011134</concept_id>
<concept_desc>Software and its engineering~Distributed systems organizing principles</concept_desc>
<concept_significance>300</concept_significance></concept>
</ccs2012>
\end{CCSXML}
\ccsdesc[500]{Software and its engineering~Software fault tolerance}
\ccsdesc[300]{Software and its engineering~Distributed systems organizing principles}

\keywords{rollback recovery, microservices, deterministic replay, distributed systems}

\maketitle

\section{Introduction}
Microservice systems today rely on ad-hoc recovery strategies---retry,
manual replay, or eventual consistency---which trade off correctness,
latency, and developer burden. At scale, these strategies produce the
long tail of incidents that dominate postmortem reports.

Deterministic replay promises a principled alternative, but composes
poorly across services with non-determinism at every RPC boundary. We
designed Persephone from this observation: by moving determinism
enforcement into a sidecar that intercepts RPC traffic, we recover
cross-service replay without instrumenting every application.

\paragraph{Contributions.}
\begin{itemize}
\item A sidecar architecture that enforces per-service determinism with
  minimal logging overhead.
\item A causal log-pruning algorithm that safely discards logs once
  downstream effects are confirmed.
\item An evaluation showing \textless 2\% median overhead and
  sub-30-second recovery at 200-node scale.
\item Production deployment results over 11 months covering three real
  incident recoveries.
\end{itemize}

\section{Background and Motivation}
Deterministic replay is well-studied for single processes but rarely
crosses service boundaries. We surveyed 34 postmortem reports from two
large organizations and found that 71\% of SEV-1 incidents involved
cross-service state inconsistency that would, in principle, be
recoverable via replay.

\section{Design}
\subsection{Architecture}
Each service runs a Persephone sidecar that appends causal records to a
per-service log stored on shared log storage. A coordinator orchestrates
rollback across services when recovery is triggered.

\subsection{Low-Overhead Logging}
We define the runtime overhead as
\begin{equation}
  \text{OH}(t) = \frac{t_{\text{Persephone}} - t_{\text{vanilla}}}{t_{\text{vanilla}}}.
\end{equation}
Persephone logs only causally necessary events: inbound RPC non-determinism
and timer reads. Pure function results are recomputed at replay time.

\subsection{Log Pruning}
\begin{algorithm}[t]
\caption{Causal log pruning (background)}
\begin{algorithmic}[1]
\State $C \gets \text{latest confirmed epoch across dependents}$
\For{each log segment $s$ with \text{epoch}$(s) < C$}
  \State delete $s$
\EndFor
\end{algorithmic}
\end{algorithm}

\section{Implementation}
Persephone is 14{,}200 lines of Rust plus a 3{,}800-line Go coordinator
and a 1{,}900-line eBPF program for sidecar interception. The
implementation is open-sourced at \url{https://github.com/example/persephone}.

\begin{lstlisting}[language=Rust,caption={Sidecar log append (simplified).},label={lst:log}]
fn log_rpc(ctx: &mut Ctx, call: &RpcCall) -> Result<()> {
    let rec = Record::from_call(call, ctx.epoch);
    ctx.writer.append(rec)?;
    Ok(())
}
\end{lstlisting}

\section{Evaluation}
We evaluate on a 200-node benchmark cluster running four representative
microservice workloads (HotelReservation, SocialNetwork, DeathStarBench,
and an internal trace replay).

\begin{table}[t]
\centering
\begin{tabular}{lcc}
\toprule
System & Runtime OH (\%) & Recovery time (s) \\
\midrule
MonkeyDB      & 11.8 & 140 \\
RollbackKit   & 8.2 &  90 \\
\textbf{Persephone} & \textbf{1.6} & \textbf{28} \\
\bottomrule
\end{tabular}
\caption{Overhead and recovery time on 200-node benchmark workloads.}
\label{tab:main}
\end{table}

\subsection{Production Experience}
Persephone has been running in two production environments for 11
months. Over this period, three SEV-1 incidents triggered recoveries,
each of which completed within 60 seconds and restored consistent state
with no manual intervention.

\section{Related Work}
Deterministic replay (rr, R2, Castor), microservice orchestration
frameworks, log-structured storage.

\section{Conclusion}
Low-overhead deterministic replay is tractable for modern microservice
systems if the boundaries are well-chosen. Persephone demonstrates a
sidecar-based architecture that delivers both low overhead and fast
recovery at production scale.

\section*{Artifact Availability}
Source code, evaluation scripts, and datasets are available at
\url{https://github.com/example/persephone}. The artifact has been
evaluated under the SOSP artifact evaluation process.

\begin{acks}
We thank our shepherd and the anonymous SOSP reviewers. This work was
supported in part by NSF CNS-XXXXXXX and a gift from Example Research.
\end{acks}

\bibliographystyle{ACM-Reference-Format}
\bibliography{refs}

\end{document}
Bibby Mascot

PDF Preview

Create an account to compile and preview

SOSP LaTeX Template | Free Download & Preview - Bibby