Where can I find free LaTeX templates?

Bibby offers 200+ free LaTeX templates at trybibby.com/latex-templates — conference papers, lab reports, project reports, resumes, theses, and more. Preview and edit in your browser.

What are the best templates for LaTeX?

For research papers use NeurIPS or ICML templates; for student work use lab report or project report templates; for job applications use academic CV or resume templates. All are free on Bibby.

Can I download LaTeX templates for free?

Yes. Every template on Bibby is free to preview, download as .tex, and edit online. No installation or credit card required.

Templates

ACL

PreviewConference

ACL

ACL conference long paper using the official acl style. Two-column, 8-page body + unlimited references layout, with ACL Anthology citation format. Includes the mandatory Limitations section and Ethics Considerations per ACL 2023+ policy. Same template works for Findings with minor tweaks.

License

Free to use (MIT)

File

acl/main.tex

main.texRead-only preview

\documentclass[11pt]{article}

% ACL 2024+ uses the shared `acl` style. Pass the `review` option for
% anonymous submission; no option for camera-ready. Pass `final` to hide
% line numbers in older versions of the style.
\usepackage[]{acl}

\usepackage{times}
\usepackage{latexsym}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{microtype}
\usepackage{inconsolata}
\usepackage{graphicx}
\usepackage{amsmath,amssymb}
\usepackage{booktabs}
\usepackage{natbib}
\usepackage{multirow}
\usepackage{url}

% If the title and author information does not fit in the given area, or
% if the title is too long, decrease the maxdepth argument.
\title{Calibrated Confidence Estimation for\\
       Large Language Model Responses}

\author{First Last \\
  University of Example \\
  \texttt{[email protected]} \\\And
  Jane Doe \\
  Example Research Labs \\
  \texttt{[email protected]} \\\AND
  John Smith \\
  University of Example \\
  \texttt{[email protected]} \\}

\begin{document}
\maketitle

\begin{abstract}
Large language models produce fluent but often overconfident answers,
limiting their reliability for deployment in high-stakes domains. We
introduce a lightweight calibration method that computes response-level
confidence from internal token statistics, without any additional
training or calibration data. Across seven QA benchmarks and six model
families spanning 7B--70B parameters, our method reduces expected
calibration error (ECE) by 48\% on average relative to temperature
scaling, while preserving task accuracy. Unlike learned approaches, our
method requires no held-out labeled data and generalizes zero-shot
across domains.
\end{abstract}

\section{Introduction}
Reliable confidence estimates matter for safe deployment of large
language models. Existing calibration approaches either require
substantial labeled data~\citep{guo2017calibration} or degrade
downstream accuracy as a side effect of confidence adjustment.

In this work we show that an effective calibration signal is already
present in the token-level probabilities produced by any LLM:
responses that are internally consistent in their probability
assignments tend to be factually correct, while responses with
high-entropy or bimodal assignments tend to be incorrect.

\paragraph{Contributions.}
\begin{itemize}
\item We introduce a training-free calibration method based on a
  closed-form aggregate of token log-probabilities.
\item We prove that the resulting scores are monotonic in the log-odds
  of correctness under a weak statistical assumption on the model.
\item We evaluate across seven QA benchmarks and six model families,
  showing a 48\% average reduction in ECE with no accuracy loss.
\end{itemize}

\section{Related Work}
Classical temperature scaling and Platt scaling~\citep{guo2017calibration}
require held-out labeled data. Self-consistency~\citep{wang2023selfconsistency}
improves accuracy but leaves calibration uneven. Verbalized confidence
prompts are sensitive to prompt wording and model scale.

\section{Method}
Let $p_{i,t}$ be the token probability at position $t$ in response $i$
of length $T_i$. We compute the response confidence as
\begin{equation}
  c_i = \exp\!\left( \frac{1}{T_i}\sum_{t=1}^{T_i} \log p_{i,t} \right) \cdot \sigma(\alpha).
  \label{eq:conf}
\end{equation}
The scalar $\alpha$ is fit on a small held-out set with as few as 100
examples. When no calibration set is available we default to $\alpha = 0$,
which we show retains most of the gain.

\subsection{Theoretical Justification}
Under a mild monotonicity assumption on the model's log-probability
distribution, the aggregate in Eq.~\ref{eq:conf} is monotonic in the
log-odds of correctness. A full proof appears in Appendix~\ref{app:proof}.

\section{Experiments}
\subsection{Setup}
We evaluate on Natural Questions, TriviaQA, StrategyQA, 2WikiMQA,
HotpotQA, WebQuestions, and MMLU. Models include Llama-3 (8B, 70B),
Mistral (7B), Mixtral (8x7B), and two proprietary APIs.

\begin{table}[t]
\centering
\small
\begin{tabular}{lcc}
\toprule
Method & ECE $\downarrow$ & Acc.\ $\uparrow$ \\
\midrule
Greedy                   & 0.18 & 62.4 \\
Temperature scaling      & 0.12 & 62.4 \\
Self-consistency         & 0.11 & 64.1 \\
\textbf{Ours}            & \textbf{0.06} & \textbf{64.3} \\
\bottomrule
\end{tabular}
\caption{Averaged across seven QA benchmarks and six models.}
\label{tab:main}
\end{table}

\subsection{Analysis}
Token-level entropy alone correlates weakly with correctness; the
geometric mean aggregation is important. See Appendix~\ref{app:analysis}
for per-model breakdowns.

\section{Conclusion}
A simple, training-free aggregation of token log-probabilities
substantially improves LLM calibration across models, domains, and
task formats.

\section*{Limitations}
Our method assumes access to per-token log-probabilities. Some deployed
APIs do not expose these, precluding application. We do not evaluate
on free-form generation tasks where \emph{correctness} is not
well-defined. Our calibration set fitting, while small, still
introduces mild domain sensitivity.

\section*{Ethics Statement}
Improved confidence calibration can support safer LLM deployment by
allowing systems to defer to humans on uncertain responses. However,
well-calibrated confidence may also be misused to justify over-reliance
on LLM outputs. We emphasize that calibration is a necessary but not
sufficient condition for safe deployment.

\section*{Acknowledgments}
We thank the anonymous ACL reviewers and members of Example Research
Labs for valuable feedback.

\bibliography{refs}
\bibliographystyle{acl_natbib}

\appendix

\section{Proof Details}
\label{app:proof}
Complete proof of monotonicity under assumption (M1), extending the
sketch in Section 3.

\section{Additional Analyses}
\label{app:analysis}
Full per-model, per-benchmark ECE breakdowns and learning curves.

\end{document}

PDF Preview

Create an account to compile and preview

ACL

Category

License

File