Disney Research backs UEA revolution of animated characters’ speech
New research from the University of East Anglia (UK) could revolutionise the way that animated characters deliver their lines.
Animating the speech of family-favourite characters has been time-consuming and costly but computer programmers have identified a way of creating natural-looking animated speech that can be generated in real-time as voice actors deliver their lines.
The work, unveiled at a high profile expo in LA, is a collaboration which includes UEA, Disney Research, Caltech and Carnegie Mellon University. Researchers show how a ‘deep learning’ approach – using artificial neural networks – can generate natural-looking real-time animated speech.
As well as automatically generating lip sync for English speaking actors, the new software also animates singing and can be adapted for foreign languages.
The online video games industry could also benefit from the research – with characters delivering their lines on-the-fly with much more realism than is currently possible – and it could also be it can be used to animate avatars in virtual reality.
A central focus for the work has been to develop software which can be seamlessly integrated into existing production pipelines, and which is easy to edit.
Lead researcher Dr Sarah Taylor, Associate Research Scientist, Disney Research Pittsburgh, said: “Realistic speech animation is essential for effective character animation. Done badly, it can be distracting and lead to a box office flop. Doing it well, however, is both time consuming and costly as it has to be manually produced by a skilled animator.
“Our goal is to automatically generate production-quality animated speech for any style of character, given only audio speech as an input.”
The team’s approach involves ‘training’ a computer to take spoken words from a voice actor, predict the mouth shape needed, and animate a character to lip sync the speech.
This is done by first recording audio and video of a reference speaker reciting a collection of more than 2500 phonetically diverse sentences. Their face is tracked to create a ‘reference face’ animation model. The audio is then transcribed into speech sounds using off-the-shelf speech recognition software.
This collected information can then be used to generate a model that is able to animate the reference face from a frame-by-frame sequence of phonemes. This animation can then be transferred to a CG character in real-time.
Training’ the model takes just a couple of hours. Dr Taylor said: “What we are doing is translating audio speech into a phonetic representation, and then into realistic animated speech.”
The method has so far been tested against sentences from a range of different speakers.
• PHOTOGRAPH SHOWS: The University of East Anglia