RUS  ENG
Full version
JOURNALS // Computer Optics // Archive

Computer Optics, 2022 Volume 46, Issue 6, Pages 955–962 (Mi co1091)

IMAGE PROCESSING, PATTERN RECOGNITION

Method for visual analysis of driver's face for automatic lip-reading in the wild

A. A. Axyonov, D. A. Ryumin, A. M. Kashevnik, D. V. Ivanko, A. A. Karpov

St. Petersburg Federal Research Center of the Russian Academy of Sciences

Abstract: The paper proposes a method of visual analysis for automatic speech recognition of the vehicle driver. Speech recognition in acoustically noisy conditions is one of big challenges of artificial intelligence. The problem of effective automatic lip-reading in vehicle environment has not yet been resolved due to the presence of various kinds of interference (frequent turns of driver's head, vibration, varying lighting conditions, etc.). In addition, the problem is aggravated by the lack of available databases on this topic. A MediaPipe Face Mesh is used to find and extract the region-of-interest (ROI). We have developed End-to-End neural network architecture for the analysis of visual speech. Visual features are extracted from a single image using a convolutional neural network (CNN) in conjunction with a fully connected layer. The extracted features are input to a Long Short-Term Memory (LSTM) neural network. Due to a small amount of training data we proposed that a Transfer Learning method should be applied. Experiments on visual analysis and speech recognition present great opportunities for solving the problem of automatic lip-reading. The ex-periments were performed on an in-house multi-speaker audio-visual dataset RUSAVIC. The maximum recognition accuracy of 62 commands is 64.09%. The results can be used in various automatic speech recognition systems, especially in acoustically noisy conditions (high speed, open windows or a sunroof in a vehicle, backgoround music, poor noise insulation, etc.) on the road.

Keywords: vehicle, driver, visual speech recognition, automated lip-reading, machine learning, End-to-End, CNN, LSTM

Received: 25.12.2021
Accepted: 30.04.2022

DOI: 10.18287/2412-6179-CO-1092



© Steklov Math. Inst. of RAS, 2025