ACL conference long paper using the official acl style. Two-column, 8-page body + unlimited references layout, with ACL Anthology citation format. Includes the mandatory Limitations section and Ethics Considerations per ACL 2023+ policy. Same template works for Findings with minor tweaks.
acl/main.tex
\documentclass[11pt]{article}
% ACL 2024+ uses the shared `acl` style. Pass the `review` option for
% anonymous submission; no option for camera-ready. Pass `final` to hide
% line numbers in older versions of the style.
\usepackage[]{acl}
\usepackage{times}
\usepackage{latexsym}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{microtype}
\usepackage{inconsolata}
\usepackage{graphicx}
\usepackage{amsmath,amssymb}
\usepackage{booktabs}
\usepackage{natbib}
\usepackage{multirow}
\usepackage{url}
% If the title and author information does not fit in the given area, or
% if the title is too long, decrease the maxdepth argument.
\title{Calibrated Confidence Estimation for\\
Large Language Model Responses}
\author{First Last \\
University of Example \\
\texttt{[email protected]} \\\And
Jane Doe \\
Example Research Labs \\
\texttt{[email protected]} \\\AND
John Smith \\
University of Example \\
\texttt{[email protected]} \\}
\begin{document}
\maketitle
\begin{abstract}
Large language models produce fluent but often overconfident answers,
limiting their reliability for deployment in high-stakes domains. We
introduce a lightweight calibration method that computes response-level
confidence from internal token statistics, without any additional
training or calibration data. Across seven QA benchmarks and six model
families spanning 7B--70B parameters, our method reduces expected
calibration error (ECE) by 48\% on average relative to temperature
scaling, while preserving task accuracy. Unlike learned approaches, our
method requires no held-out labeled data and generalizes zero-shot
across domains.
\end{abstract}
\section{Introduction}
Reliable confidence estimates matter for safe deployment of large
language models. Existing calibration approaches either require
substantial labeled data~\citep{guo2017calibration} or degrade
downstream accuracy as a side effect of confidence adjustment.
In this work we show that an effective calibration signal is already
present in the token-level probabilities produced by any LLM:
responses that are internally consistent in their probability
assignments tend to be factually correct, while responses with
high-entropy or bimodal assignments tend to be incorrect.
\paragraph{Contributions.}
\begin{itemize}
\item We introduce a training-free calibration method based on a
closed-form aggregate of token log-probabilities.
\item We prove that the resulting scores are monotonic in the log-odds
of correctness under a weak statistical assumption on the model.
\item We evaluate across seven QA benchmarks and six model families,
showing a 48\% average reduction in ECE with no accuracy loss.
\end{itemize}
\section{Related Work}
Classical temperature scaling and Platt scaling~\citep{guo2017calibration}
require held-out labeled data. Self-consistency~\citep{wang2023selfconsistency}
improves accuracy but leaves calibration uneven. Verbalized confidence
prompts are sensitive to prompt wording and model scale.
\section{Method}
Let $p_{i,t}$ be the token probability at position $t$ in response $i$
of length $T_i$. We compute the response confidence as
\begin{equation}
c_i = \exp\!\left( \frac{1}{T_i}\sum_{t=1}^{T_i} \log p_{i,t} \right) \cdot \sigma(\alpha).
\label{eq:conf}
\end{equation}
The scalar $\alpha$ is fit on a small held-out set with as few as 100
examples. When no calibration set is available we default to $\alpha = 0$,
which we show retains most of the gain.
\subsection{Theoretical Justification}
Under a mild monotonicity assumption on the model's log-probability
distribution, the aggregate in Eq.~\ref{eq:conf} is monotonic in the
log-odds of correctness. A full proof appears in Appendix~\ref{app:proof}.
\section{Experiments}
\subsection{Setup}
We evaluate on Natural Questions, TriviaQA, StrategyQA, 2WikiMQA,
HotpotQA, WebQuestions, and MMLU. Models include Llama-3 (8B, 70B),
Mistral (7B), Mixtral (8x7B), and two proprietary APIs.
\begin{table}[t]
\centering
\small
\begin{tabular}{lcc}
\toprule
Method & ECE $\downarrow$ & Acc.\ $\uparrow$ \\
\midrule
Greedy & 0.18 & 62.4 \\
Temperature scaling & 0.12 & 62.4 \\
Self-consistency & 0.11 & 64.1 \\
\textbf{Ours} & \textbf{0.06} & \textbf{64.3} \\
\bottomrule
\end{tabular}
\caption{Averaged across seven QA benchmarks and six models.}
\label{tab:main}
\end{table}
\subsection{Analysis}
Token-level entropy alone correlates weakly with correctness; the
geometric mean aggregation is important. See Appendix~\ref{app:analysis}
for per-model breakdowns.
\section{Conclusion}
A simple, training-free aggregation of token log-probabilities
substantially improves LLM calibration across models, domains, and
task formats.
\section*{Limitations}
Our method assumes access to per-token log-probabilities. Some deployed
APIs do not expose these, precluding application. We do not evaluate
on free-form generation tasks where \emph{correctness} is not
well-defined. Our calibration set fitting, while small, still
introduces mild domain sensitivity.
\section*{Ethics Statement}
Improved confidence calibration can support safer LLM deployment by
allowing systems to defer to humans on uncertain responses. However,
well-calibrated confidence may also be misused to justify over-reliance
on LLM outputs. We emphasize that calibration is a necessary but not
sufficient condition for safe deployment.
\section*{Acknowledgments}
We thank the anonymous ACL reviewers and members of Example Research
Labs for valuable feedback.
\bibliography{refs}
\bibliographystyle{acl_natbib}
\appendix
\section{Proof Details}
\label{app:proof}
Complete proof of monotonicity under assumption (M1), extending the
sketch in Section 3.
\section{Additional Analyses}
\label{app:analysis}
Full per-model, per-benchmark ECE breakdowns and learning curves.
\end{document}

PDF Preview
Create an account to compile and preview