Templates

Interspeech 2026

Preview

Interspeech 2026

Interspeech 2026 speech and language processing conference paper template

Category

Conference

License

Free to use (MIT)

File

interspeech-2026/main.tex

main.texRead-only preview
\documentclass[a4paper,10pt,twocolumn]{article}

\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{times}
\usepackage{amsmath}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage[margin=2cm]{geometry}
\usepackage{booktabs}
\usepackage{cite}

\title{Cross-Lingual End-to-End Speech Recognition with\\Multilingual Adapter Modules and Self-Supervised Pretraining}

\author{
  Yuki Tanaka$^1$, Maria Gonzalez$^2$, Raj Patel$^1$, Lena M\"uller$^3$ \\\\
  $^1$Department of Computer Science, University of Tokyo, Japan \\\\
  $^2$Institute for Language Technologies, Universitat Polit\`ecnica de Catalunya, Spain \\\\
  $^3$Max Planck Institute for Intelligent Systems, Germany \\\\
  \texttt{\{tanaka,patel\}@cs.u-tokyo.ac.jp, [email protected], [email protected]}
}

\date{}

\begin{document}

\maketitle

\begin{abstract}
We present a novel approach to cross-lingual automatic speech recognition (ASR) that leverages multilingual adapter modules integrated into a self-supervised pretrained speech encoder. Our method enables effective transfer learning across 12 typologically diverse languages while requiring only minimal language-specific parameters. By introducing lightweight adapter layers between frozen transformer blocks, we achieve an average word error rate (WER) reduction of 23.4\% compared to monolingual baselines and 11.7\% compared to standard multilingual fine-tuning on the CommonVoice and FLEURS benchmarks. We further demonstrate that our adapter architecture facilitates zero-shot transfer to unseen languages, achieving competitive performance on three held-out languages without any target-language training data. Analysis of the learned adapter representations reveals that the model captures both language-universal phonetic features and language-specific acoustic patterns across different layers of the network.
\end{abstract}

\noindent\textbf{Index Terms:} cross-lingual speech recognition, self-supervised learning, adapter modules, multilingual ASR, transfer learning

\section{Introduction}

End-to-end automatic speech recognition has made remarkable progress in recent years, driven by advances in neural network architectures and the availability of large-scale training data~\cite{baevski2020wav2vec}. However, building high-quality ASR systems for the world's approximately 7,000 languages remains a significant challenge, as most languages lack sufficient labeled speech data for training robust models.

Self-supervised pretraining on unlabeled speech has emerged as a promising approach to address this data scarcity problem~\cite{hsu2021hubert}. Models such as wav2vec 2.0, HuBERT, and WavLM learn powerful speech representations from raw audio, which can then be fine-tuned on downstream tasks with limited labeled data. While these pretrained models have shown impressive results in monolingual settings, their application to multilingual and cross-lingual scenarios introduces additional complexity.

A naive approach to multilingual ASR involves jointly fine-tuning a pretrained model on data from multiple languages. However, this strategy suffers from negative transfer between dissimilar languages and catastrophic forgetting of language-specific features~\cite{conneau2021unsupervised}. Furthermore, it requires retraining the entire model whenever a new language is added, making it impractical for scaling to many languages.

In this work, we propose a parameter-efficient approach to cross-lingual ASR that addresses these limitations through multilingual adapter modules. Our key contributions are:
\begin{itemize}
  \item We introduce a novel adapter architecture specifically designed for speech processing that captures both shared and language-specific acoustic patterns.
  \item We demonstrate that our approach achieves state-of-the-art results on 12 languages from the CommonVoice and FLEURS benchmarks while adding fewer than 2\% additional parameters per language.
  \item We show that our adapter representations enable effective zero-shot transfer to unseen languages, outperforming standard approaches by a significant margin.
  \item We provide a detailed analysis of what linguistic information is captured at different layers of the adapted model.
\end{itemize}

\section{Related Work}

\subsection{Self-Supervised Speech Representations}

The paradigm of self-supervised learning for speech has evolved rapidly. CPC~\cite{oord2018representation} introduced contrastive predictive coding for audio. wav2vec 2.0~\cite{baevski2020wav2vec} combined contrastive learning with quantized speech representations, achieving remarkable results with limited labeled data. HuBERT~\cite{hsu2021hubert} introduced an offline clustering step to generate pseudo-labels for masked prediction. More recently, WavLM extended this line of work by incorporating denoising objectives.

\subsection{Multilingual Speech Recognition}

Multilingual ASR has been studied extensively in both hybrid and end-to-end frameworks. Whisper demonstrated that scaling data and model size enables strong multilingual recognition. MMS extended this to over 1,000 languages using religious text recordings. However, these approaches require substantial computational resources for training and do not efficiently accommodate new languages.

\subsection{Adapter-Based Transfer Learning}

Adapter modules were originally proposed for efficient transfer learning in NLP~\cite{houlsby2019parameter}. They insert small bottleneck layers into pretrained transformers, allowing task-specific adaptation while keeping most parameters frozen. Recent work has explored adapters for speech tasks, including speaker verification and emotion recognition, but their application to cross-lingual ASR remains underexplored.

\section{Method}

\subsection{Model Architecture}

Our model builds upon a pretrained HuBERT-Large encoder consisting of a convolutional feature extractor followed by 24 transformer layers. We insert adapter modules after each transformer layer while keeping the pretrained parameters frozen during fine-tuning.

Each adapter module consists of a layer normalization step, a down-projection from dimension $d$ to bottleneck dimension $m$, a nonlinear activation function, and an up-projection back to dimension $d$:
\begin{equation}
  \text{Adapter}(\mathbf{h}) = \mathbf{h} + \mathbf{W}_{\text{up}} \cdot \sigma(\mathbf{W}_{\text{down}} \cdot \text{LN}(\mathbf{h}))
\end{equation}
where $\mathbf{h} \in \mathbb{R}^{T \times d}$ is the hidden representation, $\mathbf{W}_{\text{down}} \in \mathbb{R}^{d \times m}$, $\mathbf{W}_{\text{up}} \in \mathbb{R}^{m \times d}$, and $\sigma$ denotes the GELU activation.

\subsection{Language-Specific and Shared Adapters}

We decompose each adapter into a shared component and a language-specific component:
\begin{equation}
  \text{Adapter}_\ell(\mathbf{h}) = \text{Adapter}_{\text{shared}}(\mathbf{h}) + \alpha_\ell \cdot \text{Adapter}_{\ell\text{-spec}}(\mathbf{h})
\end{equation}
where $\ell$ indexes the language and $\alpha_\ell$ is a learned scalar gating parameter. The shared adapter captures universal phonetic features while language-specific adapters encode language-particular acoustic patterns.

\subsection{Training Procedure}

We train our model in two stages. First, we train the shared adapters on a combined multilingual dataset using CTC loss. Second, we initialize language-specific adapters from the shared adapters and fine-tune them on individual languages while keeping the shared adapters frozen:
\begin{equation}
  \mathcal{L}_{\text{total}} = \mathcal{L}_{\text{CTC}} + \lambda \|\boldsymbol{\theta}_{\text{spec}} - \boldsymbol{\theta}_{\text{shared}}\|_2^2
\end{equation}
where the regularization term prevents language-specific adapters from diverging too far from the shared initialization.

\section{Experimental Setup}

\subsection{Datasets}

We evaluate on 12 languages from CommonVoice 13.0 and FLEURS: English (en), Spanish (es), French (fr), German (de), Mandarin (zh), Japanese (ja), Arabic (ar), Hindi (hi), Swahili (sw), Turkish (tr), Finnish (fi), and Hungarian (hu). For zero-shot experiments, we hold out Korean (ko), Thai (th), and Vietnamese (vi).

\subsection{Baselines}

We compare against: (1) monolingual fine-tuning of HuBERT-Large, (2) joint multilingual fine-tuning, (3) standard adapter tuning without shared/specific decomposition, and (4) Whisper-Large-v3.

\subsection{Implementation Details}

We use a bottleneck dimension of $m=256$, a learning rate of $3 \times 10^{-4}$ with linear warmup over 10k steps, and train for 100k steps with a batch size of 32. All experiments are conducted on 4 NVIDIA A100 GPUs.

\section{Results}

\subsection{Main Results}

Table~\ref{tab:main} presents WER results across all 12 languages.

\begin{table}[t]
\centering
\caption{Word Error Rate (\%) on test sets. Best results in \textbf{bold}.}
\label{tab:main}
\small
\begin{tabular}{lcccc}
\toprule
\textbf{Lang} & \textbf{Mono.} & \textbf{Multi.} & \textbf{Adapter} & \textbf{Ours} \\
\midrule
en & 5.8 & 6.2 & 5.4 & \textbf{4.7} \\
es & 7.3 & 6.9 & 6.5 & \textbf{5.8} \\
fr & 9.1 & 8.4 & 7.9 & \textbf{7.1} \\
de & 8.7 & 8.1 & 7.6 & \textbf{6.8} \\
zh & 14.2 & 12.8 & 12.1 & \textbf{10.9} \\
ja & 12.6 & 11.3 & 10.8 & \textbf{9.7} \\
ar & 16.4 & 14.7 & 13.9 & \textbf{12.3} \\
hi & 18.9 & 16.2 & 15.4 & \textbf{13.8} \\
sw & 22.1 & 19.8 & 18.5 & \textbf{16.4} \\
tr & 11.5 & 10.3 & 9.8 & \textbf{8.9} \\
fi & 13.8 & 12.1 & 11.4 & \textbf{10.2} \\
hu & 14.1 & 12.7 & 11.9 & \textbf{10.6} \\
\midrule
Avg. & 12.9 & 11.6 & 10.9 & \textbf{9.8} \\
\bottomrule
\end{tabular}
\end{table}

Our approach achieves the lowest WER across all 12 languages, with an average relative improvement of 23.4\% over monolingual fine-tuning and 10.1\% over standard adapter tuning. The improvements are particularly pronounced for low-resource languages such as Swahili (25.8\% relative) and Hindi (27.0\% relative).

\subsection{Zero-Shot Transfer}

For zero-shot evaluation on held-out languages, we use only the shared adapters without any language-specific fine-tuning. Our model achieves WERs of 18.3\% (ko), 21.7\% (th), and 19.4\% (vi), compared to 24.1\%, 28.9\%, and 25.6\% for multilingual fine-tuning, representing an average relative improvement of 24.4\%.

\subsection{Ablation Study}

We conduct ablations on the adapter architecture. Removing the shared-specific decomposition increases average WER by 1.1\% absolute. Removing the regularization term increases it by 0.8\%. Using a smaller bottleneck ($m=64$) increases WER by 1.4\%, while a larger one ($m=512$) provides marginal improvement at significantly higher parameter cost.

\section{Analysis}

\subsection{Layer-Wise Representation Analysis}

Using centered kernel alignment (CKA), we analyze the similarity between shared and language-specific adapter representations at each layer. Lower layers (1--8) show high similarity across languages, suggesting they capture universal acoustic features. Middle layers (9--16) exhibit moderate divergence, corresponding to phoneme-level processing. Upper layers (17--24) show the greatest language-specific variation, consistent with their role in modeling language-specific phonotactic and morphological patterns.

\subsection{Parameter Efficiency}

Each language-specific adapter adds only 1.8M parameters (1.6\% of the base model), while the shared adapters contribute 1.8M parameters amortized across all languages. In total, our model uses 23.4M additional parameters for 12 languages, compared to 12$\times$315M for monolingual models.

\section{Conclusion}

We have presented a parameter-efficient approach to cross-lingual ASR using multilingual adapter modules with shared and language-specific components. Our method achieves state-of-the-art results on 12 languages while enabling effective zero-shot transfer to unseen languages. The decomposed adapter architecture provides an interpretable framework for understanding how multilingual speech representations are organized across transformer layers. Future work will explore scaling to more languages and integrating text-based language models for improved decoding.

\section{Acknowledgements}

This work was supported by JST CREST Grant Number JPMJCR2015 and the European Research Council under Grant Agreement No. 819458.

\bibliographystyle{IEEEtran}
\begin{thebibliography}{9}
\bibitem{baevski2020wav2vec} A.~Baevski, Y.~Zhou, A.~Mohamed, and M.~Auli, ``wav2vec 2.0: A framework for self-supervised learning of speech representations,'' in \emph{Proc. NeurIPS}, 2020.
\bibitem{hsu2021hubert} W.-N.~Hsu, B.~Bolte, Y.-H.~H.~Tsai, K.~Lakhotia, R.~Salakhutdinov, and A.~Mohamed, ``HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,'' \emph{IEEE/ACM TASLP}, vol.~29, pp.~3451--3460, 2021.
\bibitem{conneau2021unsupervised} A.~Conneau, A.~Baevski, R.~Collobert, A.~Mohamed, and M.~Auli, ``Unsupervised cross-lingual representation learning for speech recognition,'' in \emph{Proc. Interspeech}, 2021.
\bibitem{oord2018representation} A.~van~den~Oord, Y.~Li, and O.~Vinyals, ``Representation learning with contrastive predictive coding,'' \emph{arXiv preprint arXiv:1807.03748}, 2018.
\bibitem{houlsby2019parameter} N.~Houlsby, A.~Giurgiu, S.~Jastrzebski, B.~Morber, O.~de~Brébisson, A.~Bengio, and Y.~Bengio, ``Parameter-efficient transfer learning for NLP,'' in \emph{Proc. ICML}, 2019.
\end{thebibliography}

\end{document}
Bibby Mascot

PDF Preview

Create an account to compile and preview

Interspeech 2026 LaTeX Template | Free Download & Preview - Bibby