NAACL

NAACL long paper using the shared ACL style. Double-column, 8-page body + unlimited references, Limitations and Ethics Statement sections required by NAACL since 2022. Works for Findings with the same layout.

Category

Conference

License

Free to use (MIT)

File

naacl/main.tex

main.texRead-only preview
\documentclass[11pt]{article}

\usepackage[]{acl}

\usepackage{times}
\usepackage{latexsym}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{microtype}
\usepackage{inconsolata}
\usepackage{graphicx}
\usepackage{amsmath,amssymb}
\usepackage{booktabs}
\usepackage{natbib}
\usepackage{multirow}
\usepackage{url}

\title{Cross-Lingual Transfer of Discourse Structure\\
       via Parallel Corpus Mining}

\author{First Last \\
  University of Example \\
  \texttt{[email protected]} \\\And
  Jane Doe \\
  University of Example \\
  \texttt{[email protected]} \\}

\begin{document}
\maketitle

\begin{abstract}
We present a method for transferring discourse parsers across languages
by projecting structure from high-resource (English) to low-resource
languages using noisy parallel corpora. Our approach aligns sentences
via an off-the-shelf aligner, projects discourse relations across the
alignment, filters by confidence, and trains a target-language parser
on the resulting silver trees. On five target languages spanning three
typological families, we achieve an average 4.8 F1 improvement over
zero-shot transfer without using any target-language discourse
annotation. We release the projected silver corpora for all five target
languages.
\end{abstract}

\section{Introduction}
Discourse parsing resources are scarce outside English. RST-DT and PDTB
remain the primary training corpora. For most world languages, no
annotated discourse treebank exists, and creating one is prohibitively
expensive.

Parallel corpora are abundant but noisy. We show how to exploit them via
alignment-aware structure projection, producing silver training data for
target-language parsers that substantially outperform zero-shot
cross-lingual transfer.

\paragraph{Contributions.} (1) A projection-and-filter pipeline for
cross-lingual discourse parser transfer; (2) empirical results on five
typologically diverse target languages; (3) open release of the silver
corpora.

\section{Related Work}
Cross-lingual parsing~\citep{mcdonald2011parser}, RST parsing, and
cross-lingual projection methods for dependency parsing and semantic
roles are the relevant prior work.

\section{Method}
\subsection{Projection}
For each source--target sentence pair we induce an alignment via an
off-the-shelf aligner (awesome-align). The source discourse tree is
projected node-by-node through the alignment.

\subsection{Filtering}
We filter projected trees by alignment confidence and constituent
coverage. Trees below a confidence threshold are discarded, yielding a
clean silver corpus roughly 60\% the size of the raw projections.

\subsection{Training}
The silver trees train a target-language parser using the same
architecture as our English reference parser, initialized from XLM-R.

\section{Experiments}
\begin{table}[t]
\centering
\small
\begin{tabular}{lccccc}
\toprule
Target & De & Es & Ru & Zh & Ar \\
\midrule
Zero-shot    & 52.3 & 58.1 & 44.6 & 41.8 & 39.2 \\
+Translate   & 55.7 & 60.4 & 47.1 & 44.3 & 41.9 \\
\textbf{Ours}& \textbf{58.8} & \textbf{63.7} & \textbf{51.2} & \textbf{47.6} & \textbf{45.1} \\
\bottomrule
\end{tabular}
\caption{Micro-F1 on RST-DT-style discourse trees across target languages.}
\label{tab:main}
\end{table}

\subsection{Analysis}
Alignment quality is the dominant factor in projection success.
Sentence-pair filtering matters more than downstream parser model choice,
suggesting that cleaner silver data generalizes better than model tuning.

\section{Conclusion}
Noisy parallel corpora are a practical resource for cross-lingual
discourse parsing. We hope the release of silver corpora for five
languages will seed further work.

\section*{Limitations}
Performance on morphologically rich languages (Ar, Ru) lags
English-projection baselines, reflecting alignment errors in
agglutinative contexts. We do not address scope-of-attachment ambiguities,
which remain open. Our evaluation relies on RST-DT-style trees which may
not perfectly capture discourse structure in non-Indo-European languages.

\section*{Ethics Statement}
The parallel corpora we use (OPUS, ParaCrawl) are publicly released.
Discourse parsing itself is a low-risk NLP task; we see no direct ethical
concerns. However, downstream NLP systems built on discourse parsers
may inherit biases from the underlying parallel corpora.

\section*{Acknowledgments}
We thank the NAACL reviewers and collaborators at the University of
Example.

\bibliography{refs}
\bibliographystyle{acl_natbib}

\end{document}
Bibby Mascot

PDF Preview

Create an account to compile and preview

NAACL LaTeX Template | Free Download & Preview - Bibby