O. V. Goncharova, “Deep learning for an automatic transcription system development”, Proceedings of ISP RAS, 2025, Volume 37, Issue 1,Pages <nobr>145

Deep learning for an automatic transcription system development

O. V. Goncharova^abc

^a Pyatigorsk State University
^b Peoples' Friendship University of Russia named after Patrice Lumumba
^c Ivannikov Institute for System Programming of the RAS

Abstract: This paper presents a deep neural network architecture for automatic phoneme recognition in speech signals. The proposed model combines convolutional and recurrent layers, as well as an attention mechanism enriched with reference values of vowel formant frequencies. This allows the model to effectively extract local and global acoustic features necessary for accurate phoneme sequence recognition. Particular attention is paid to the problem of imbalanced phoneme frequency in the training dataset and ways to overcome it, such as data augmentation and the use of a weighted loss function. The reported results demonstrate the viability of the proposed approach, but also indicate the need for further model refinement to achieve higher accuracy and recall in the speech recognition task.

Keywords: Automatic speech recognition, phonetic transcription, deep neural networks, formants

DOI: 10.15514/ISPRAS-2025-37(1)-9