RUS  ENG
Full version
JOURNALS // Zapiski Nauchnykh Seminarov POMI // Archive

Zap. Nauchn. Sem. POMI, 2025 Volume 546, Pages 223–245 (Mi znsl7639)

Clarispeech: LLM-enhanced speech recognition post-correction

A. Iudinab, M. Skripkinbc, O. Y. Rogovbc, D. Korzhbc

a Moscow Technical University of Communications and Informatics
b AIRI
c Skoltech

Abstract: Recent advances in Automatic Speech Recognition (ASR) have made these systems widely applicable, including in virtual assistants and web-based interfaces. However, even cutting-edge ASR models often produce errors, particularly when adapting to new speech domains. Conventional solutions involve fine-tuning ASR models on target-domain data or integrating language models (LMs) to rescore predictions. However, joint fine-tuning of ASR and LM models can be unstable, demand substantial training data, and suffer from alignment issues. Using more sophisticated language models for shallow fusion, especially large language models (LLMs), is impractical, leading to significant computational overhead. In this paper, we address these challenges by focusing on post-transcription corrections, using parameter-efficient fine-tuning of external language models while leaving the ASR system frozen. Our experiments show that this approach significantly improves accuracy and computational efficiency. Compared to the baseline ASR system, employing an ASR+LLM configuration reduces the word error rate from 12% to 10%, while increasing computational cost by less than 50%, despite an eightfold rise in the number of parameters.

Key words and phrases: Deep Learning, Automatic Speech Recognition, Large Language Models, Natural Language Processing, Artificial Intelligence, Speech recognition methods.

UDC: 004.89

Received: 28.02.2025

Language: English



© Steklov Math. Inst. of RAS, 2026