Improving audio-driven visual dubbing solutions using self-supervised generative adversarial networks
dc.contributor.author | Ranchod, Mayur | |
dc.contributor.supervisor | Klein, Richard | |
dc.date.accessioned | 2024-10-23T15:17:37Z | |
dc.date.available | 2024-10-23T15:17:37Z | |
dc.date.issued | 2023-09 | |
dc.description | A dissertation submitted in fulfilment of the requirements for the degree of Master of Science, to the Faculty of Science, School of Computer Science & Applied Mathematics, University of the Witwatersrand, Johannesburg, 2023. | |
dc.description.abstract | Audio-driven visual dubbing (ADVD) is the process of accepting a talking-face video, along with a dubbing audio segment, as inputs and producing a dubbed video such that the speaker appears to be uttering the dubbing audio. ADVD aims to address the language barrier inherent in the consumption of video-based content caused by the various languages in which videos may be presented. Specifically, a video may only be consumed by the audience that is familiar with the spoken language. Traditional solutions, such as subtitles and audio-dubbing, hinder the viewer’s experience by either obstructing the on-screen content or introducing an unpleasant discrepancy between the speaker’s mouth movements and the input dubbing audio, respectively. In contrast, ADVD strives to achieve a natural viewing experience by synchronizing the speaker’s mouth movements with the dubbing audio. A comprehensive survey of several ADVD solutions revealed that most existing solutions achieve satisfactory visual quality and lip-sync accuracy but are limited to low-resolution videos with frontal or near frontal faces. Since this is in sharp contrast to real-world videos, which are high-resolution and contain arbitrary head poses, we present one of the first ADVD solutions trained with high-resolution data and also introduce the first pose-invariant ADVD solution. Our results show that the presented solution achieves superior visual quality while also achieving high measures of lip-sync accuracy, consequently enabling the solution to achieve significantly improved results when applied to real-world videos. | |
dc.description.submitter | MM2024 | |
dc.faculty | Faculty of Science | |
dc.identifier | 0000-0001-6537-2727 | |
dc.identifier.citation | Ranchod, Mayur. (2023). Improving audio-driven visual dubbing solutions using self-supervised generative adversarial networks. [Master's dissertation, University of the Witwatersrand, Johannesburg]. https://hdl.handle.net/10539/41906 | |
dc.identifier.uri | https://hdl.handle.net/10539/41906 | |
dc.language.iso | en | |
dc.publisher | University of the Witwatersrand, Johannesburg | |
dc.rights | ©2023 University of the Witwatersrand, Johannesburg. All rights reserved. The copyright in this work vests in the University of the Witwatersrand, Johannesburg. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of University of the Witwatersrand, Johannesburg. | |
dc.rights.holder | University of the Witwatersrand, Johannesburg | |
dc.school | School of Computer Science and Applied Mathematics | |
dc.subject | Audio-driven visual dubbing | |
dc.subject | Self-supervised | |
dc.subject | Generative adversarial networks | |
dc.subject | UCTD | |
dc.subject.other | SDG-9: Industry, innovation and infrastructure | |
dc.title | Improving audio-driven visual dubbing solutions using self-supervised generative adversarial networks | |
dc.type | Dissertation |