Authors
Juergen Luettin, Gerasimos Potamianos, Chalapathy Neti
Publication date
2001/5/7
Conference
2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221)
Volume
1
Pages
169-172
Publisher
IEEE
Description
Addresses the problem of audio-visual information fusion to provide highly robust speech recognition. We investigate methods that make different assumptions about asynchrony and conditional dependence across streams and propose a technique based on composite HMMs that can account for stream asynchrony and different levels of information integration. We show how these models can be trained jointly based on maximum likelihood estimation. Experiments, performed for a speaker-independent large vocabulary continuous speech recognition task and different integration methods, show that best performance is obtained by asynchronous stream integration. This system reduces the error rate at a 8.5 dB SNR with additive speech "babble" noise by 27 % relative over audio-only models and by 12 % relative over traditional audio-visual models using concatenative feature fusion.
Total citations
20012002200320042005200620072008200920102011201220132014201520162017201820192020202120222023202451212654899425663967235311
Scholar articles
J Luettin, G Potamianos, C Neti - 2001 IEEE International Conference on Acoustics …, 2001