Some of us might consider ourselves to be expert lip readers thanks to years of experience, but the majority of the time we can only pick up a word or two here and there.
But, in a way, that’s half the fun of it right?
Researchers from the University of Oxford’s Computer Science Department have developed a lip-reading software called LipNet that can predict what a person is saying by analyzing the movement of their lips.
In fact, LipNet has proven to be 93.4 percent accurate when tested which is extremely impressive compared to the 52 percent accuracy of an experienced human lip reader.
Here’s a more in-depth explanation from the researchers:
Lipreading is the task of decoding text from the movement of a speaker’s mouth. Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. More recent deep lipreading approaches are end-to-end trainable (Wand et al., 2016; Chung & Zisserman, 2016a). All existing works, however, perform only word classification, not sentence-level sequence prediction. Studies have shown that human lipreading performance increases for longer words (Easton & Basala, 1982), indicating the importance of features capturing temporal context in an ambiguous communication channel. Motivated by this observation, we present LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, an LSTM recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end.