Improving audio-driven visual dubbing solutions using self-supervised generative adversarial networks

Ranchod, Mayur2024-10-232024-10-232023-09Ranchod, Mayur. (2023). Improving audio-driven visual dubbing solutions using self-supervised generative adversarial networks. [Master's dissertation, University of the Witwatersrand, Johannesburg]. https://hdl.handle.net/10539/41906https://hdl.handle.net/10539/41906A dissertation submitted in fulfilment of the requirements for the degree of Master of Science, to the Faculty of Science, School of Computer Science & Applied Mathematics, University of the Witwatersrand, Johannesburg, 2023.Audio-driven visual dubbing (ADVD) is the process of accepting a talking-face video, along with a dubbing audio segment, as inputs and producing a dubbed video such that the speaker appears to be uttering the dubbing audio. ADVD aims to address the language barrier inherent in the consumption of video-based content caused by the various languages in which videos may be presented. Specifically, a video may only be consumed by the audience that is familiar with the spoken language. Traditional solutions, such as subtitles and audio-dubbing, hinder the viewer’s experience by either obstructing the on-screen content or introducing an unpleasant discrepancy between the speaker’s mouth movements and the input dubbing audio, respectively. In contrast, ADVD strives to achieve a natural viewing experience by synchronizing the speaker’s mouth movements with the dubbing audio. A comprehensive survey of several ADVD solutions revealed that most existing solutions achieve satisfactory visual quality and lip-sync accuracy but are limited to low-resolution videos with frontal or near frontal faces. Since this is in sharp contrast to real-world videos, which are high-resolution and contain arbitrary head poses, we present one of the first ADVD solutions trained with high-resolution data and also introduce the first pose-invariant ADVD solution. Our results show that the presented solution achieves superior visual quality while also achieving high measures of lip-sync accuracy, consequently enabling the solution to achieve significantly improved results when applied to real-world videos.en©2023 University of the Witwatersrand, Johannesburg. All rights reserved. The copyright in this work vests in the University of the Witwatersrand, Johannesburg. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of University of the Witwatersrand, Johannesburg.Audio-driven visual dubbingSelf-supervisedGenerative adversarial networksUCTDSDG-9: Industry, innovation and infrastructureImproving audio-driven visual dubbing solutions using self-supervised generative adversarial networksDissertationUniversity of the Witwatersrand, Johannesburg