ACM SOSP paper using acmart sigconf. Double-column, 12-page body + unlimited references, supports anonymous review mode. Includes full systems-paper structure: motivation measurements, detailed design, implementation notes, evaluation, and artifact availability statement.
sosp/main.tex
% SOSP paper template using acmart sigconf. Remove "review,anonymous"
% options for camera-ready submission, and set appropriate metadata.
\documentclass[sigconf,screen,review,anonymous]{acmart}
\usepackage{graphicx}
\usepackage{amsmath,amssymb}
\usepackage{booktabs}
\usepackage{algorithm}
\usepackage{algpseudocode}
\usepackage{listings}
\usepackage{xcolor}
\lstset{
basicstyle=\ttfamily\footnotesize,
columns=fullflexible,
breaklines=true,
showstringspaces=false,
keywordstyle=\color{blue},
commentstyle=\color{gray}\itshape,
stringstyle=\color{red!60!black},
}
\AtBeginDocument{%
\providecommand\BibTeX{{%
\normalfont B\kern-0.5em{\scshape i\kern-0.25em b}\kern-0.8em\TeX}}}
\acmConference[SOSP '26]{ACM Symposium on Operating Systems Principles}{October 12--16, 2026}{City, Country}
\acmISBN{978-1-4503-XXXX-X/26/10}
\acmDOI{10.1145/XXXXXXX.XXXXXXX}
\setcopyright{acmlicensed}
\copyrightyear{2026}
\acmYear{2026}
\begin{document}
\title{Persephone: A Low-Overhead Rollback Recovery System\\
for Microservice Architectures}
\author{First Last}
\affiliation{\institution{University of Example}\city{City}\country{Country}}
\email{[email protected]}
\author{Jane Doe}
\affiliation{\institution{Example Research Labs}\country{Country}}
\email{[email protected]}
\author{John Smith}
\affiliation{\institution{University of Example}\country{Country}}
\email{[email protected]}
\renewcommand{\shortauthors}{First Last, Jane Doe, and John Smith}
\begin{abstract}
Microservice architectures make localized failures easy but make
consistent recovery hard. We present Persephone, a rollback recovery
system that uses deterministic replay with a novel low-overhead logging
scheme tailored to RPC-based microservice meshes. Persephone achieves
1.6\% median runtime overhead and recovers a 200-node system in under
30 seconds, versus 8.2\% and 90s for the next-best prior work.
Persephone has been running in two production environments for 11 months
and has performed recovery for three real incidents without data loss.
\end{abstract}
\begin{CCSXML}
<ccs2012>
<concept><concept_id>10011007.10010940</concept_id>
<concept_desc>Software and its engineering~Software fault tolerance</concept_desc>
<concept_significance>500</concept_significance></concept>
<concept><concept_id>10011007.10011074.10011134</concept_id>
<concept_desc>Software and its engineering~Distributed systems organizing principles</concept_desc>
<concept_significance>300</concept_significance></concept>
</ccs2012>
\end{CCSXML}
\ccsdesc[500]{Software and its engineering~Software fault tolerance}
\ccsdesc[300]{Software and its engineering~Distributed systems organizing principles}
\keywords{rollback recovery, microservices, deterministic replay, distributed systems}
\maketitle
\section{Introduction}
Microservice systems today rely on ad-hoc recovery strategies---retry,
manual replay, or eventual consistency---which trade off correctness,
latency, and developer burden. At scale, these strategies produce the
long tail of incidents that dominate postmortem reports.
Deterministic replay promises a principled alternative, but composes
poorly across services with non-determinism at every RPC boundary. We
designed Persephone from this observation: by moving determinism
enforcement into a sidecar that intercepts RPC traffic, we recover
cross-service replay without instrumenting every application.
\paragraph{Contributions.}
\begin{itemize}
\item A sidecar architecture that enforces per-service determinism with
minimal logging overhead.
\item A causal log-pruning algorithm that safely discards logs once
downstream effects are confirmed.
\item An evaluation showing \textless 2\% median overhead and
sub-30-second recovery at 200-node scale.
\item Production deployment results over 11 months covering three real
incident recoveries.
\end{itemize}
\section{Background and Motivation}
Deterministic replay is well-studied for single processes but rarely
crosses service boundaries. We surveyed 34 postmortem reports from two
large organizations and found that 71\% of SEV-1 incidents involved
cross-service state inconsistency that would, in principle, be
recoverable via replay.
\section{Design}
\subsection{Architecture}
Each service runs a Persephone sidecar that appends causal records to a
per-service log stored on shared log storage. A coordinator orchestrates
rollback across services when recovery is triggered.
\subsection{Low-Overhead Logging}
We define the runtime overhead as
\begin{equation}
\text{OH}(t) = \frac{t_{\text{Persephone}} - t_{\text{vanilla}}}{t_{\text{vanilla}}}.
\end{equation}
Persephone logs only causally necessary events: inbound RPC non-determinism
and timer reads. Pure function results are recomputed at replay time.
\subsection{Log Pruning}
\begin{algorithm}[t]
\caption{Causal log pruning (background)}
\begin{algorithmic}[1]
\State $C \gets \text{latest confirmed epoch across dependents}$
\For{each log segment $s$ with \text{epoch}$(s) < C$}
\State delete $s$
\EndFor
\end{algorithmic}
\end{algorithm}
\section{Implementation}
Persephone is 14{,}200 lines of Rust plus a 3{,}800-line Go coordinator
and a 1{,}900-line eBPF program for sidecar interception. The
implementation is open-sourced at \url{https://github.com/example/persephone}.
\begin{lstlisting}[language=Rust,caption={Sidecar log append (simplified).},label={lst:log}]
fn log_rpc(ctx: &mut Ctx, call: &RpcCall) -> Result<()> {
let rec = Record::from_call(call, ctx.epoch);
ctx.writer.append(rec)?;
Ok(())
}
\end{lstlisting}
\section{Evaluation}
We evaluate on a 200-node benchmark cluster running four representative
microservice workloads (HotelReservation, SocialNetwork, DeathStarBench,
and an internal trace replay).
\begin{table}[t]
\centering
\begin{tabular}{lcc}
\toprule
System & Runtime OH (\%) & Recovery time (s) \\
\midrule
MonkeyDB & 11.8 & 140 \\
RollbackKit & 8.2 & 90 \\
\textbf{Persephone} & \textbf{1.6} & \textbf{28} \\
\bottomrule
\end{tabular}
\caption{Overhead and recovery time on 200-node benchmark workloads.}
\label{tab:main}
\end{table}
\subsection{Production Experience}
Persephone has been running in two production environments for 11
months. Over this period, three SEV-1 incidents triggered recoveries,
each of which completed within 60 seconds and restored consistent state
with no manual intervention.
\section{Related Work}
Deterministic replay (rr, R2, Castor), microservice orchestration
frameworks, log-structured storage.
\section{Conclusion}
Low-overhead deterministic replay is tractable for modern microservice
systems if the boundaries are well-chosen. Persephone demonstrates a
sidecar-based architecture that delivers both low overhead and fast
recovery at production scale.
\section*{Artifact Availability}
Source code, evaluation scripts, and datasets are available at
\url{https://github.com/example/persephone}. The artifact has been
evaluated under the SOSP artifact evaluation process.
\begin{acks}
We thank our shepherd and the anonymous SOSP reviewers. This work was
supported in part by NSF CNS-XXXXXXX and a gift from Example Research.
\end{acks}
\bibliographystyle{ACM-Reference-Format}
\bibliography{refs}
\end{document}

PDF Preview
Create an account to compile and preview