|
| | | |
Patch-Based Representation of Visual Speech
Lucey, P. and Sridharan, S.
Visual information from a speaker's mouth region is
known to improve automatic speech recognition robustness, especially in the presence of acoustic noise.
To date, the vast majority of work in this field has
viewed these visual features in a holistic manner,
which may not take into account the various changes
that occur within articulation (process of changing
the shape of the vocal tract using the articulators, i.e
lips and jaw). Motivated by the work being conducted
in fields of audio-visual automatic speech recognition
(AVASR) and face recognition using articulatory features (AFs) and patches respectively, we present a
proof of concept paper which represents the mouth
region as a ensemble of image patches. Our experiments show that by dealing with the mouth region
in this manner, we are able to extract more speech
information from the visual domain. For the task of
visual-only speaker-independent isolated digit recognition, we were able to improve the relative word error
rate by more than 23% on the CUAVE audio-visual
corpus. |
Cite as: Lucey, P. and Sridharan, S. (2006). Patch-Based Representation of Visual Speech. In Proc. HCSNet Workshop on the Use of Vision in Human-Computer Interaction, (VisHCI 2006), Canberra, Australia. CRPIT, 56. Goecke, R., Robles-Kelly, A. and Caelli, T., Eds. ACS. 79-85. |
(from crpit.com)
(local if available)
|
|