Improving audio-driven visual dubbing solutions using self-supervised generative adversarial networks

Ranchod, Mayur

Improving audio-driven visual dubbing solutions using self-supervised generative adversarial networks

dc.contributor.author	Ranchod, Mayur
dc.contributor.supervisor	Klein, Richard
dc.date.accessioned	2024-10-23T15:17:37Z
dc.date.available	2024-10-23T15:17:37Z
dc.date.issued	2023-09
dc.description	A dissertation submitted in fulfilment of the requirements for the degree of Master of Science, to the Faculty of Science, School of Computer Science & Applied Mathematics, University of the Witwatersrand, Johannesburg, 2023.
dc.description.abstract	Audio-driven visual dubbing (ADVD) is the process of accepting a talking-face video, along with a dubbing audio segment, as inputs and producing a dubbed video such that the speaker appears to be uttering the dubbing audio. ADVD aims to address the language barrier inherent in the consumption of video-based content caused by the various languages in which videos may be presented. Specifically, a video may only be consumed by the audience that is familiar with the spoken language. Traditional solutions, such as subtitles and audio-dubbing, hinder the viewer’s experience by either obstructing the on-screen content or introducing an unpleasant discrepancy between the speaker’s mouth movements and the input dubbing audio, respectively. In contrast, ADVD strives to achieve a natural viewing experience by synchronizing the speaker’s mouth movements with the dubbing audio. A comprehensive survey of several ADVD solutions revealed that most existing solutions achieve satisfactory visual quality and lip-sync accuracy but are limited to low-resolution videos with frontal or near frontal faces. Since this is in sharp contrast to real-world videos, which are high-resolution and contain arbitrary head poses, we present one of the first ADVD solutions trained with high-resolution data and also introduce the first pose-invariant ADVD solution. Our results show that the presented solution achieves superior visual quality while also achieving high measures of lip-sync accuracy, consequently enabling the solution to achieve significantly improved results when applied to real-world videos.
dc.description.submitter	MM2024
dc.faculty	Faculty of Science
dc.identifier	0000-0001-6537-2727
dc.identifier.citation	Ranchod, Mayur. (2023). Improving audio-driven visual dubbing solutions using self-supervised generative adversarial networks. [Master's dissertation, University of the Witwatersrand, Johannesburg]. https://hdl.handle.net/10539/41906
dc.identifier.uri	https://hdl.handle.net/10539/41906
dc.language.iso	en
dc.publisher	University of the Witwatersrand, Johannesburg
dc.rights	©2023 University of the Witwatersrand, Johannesburg. All rights reserved. The copyright in this work vests in the University of the Witwatersrand, Johannesburg. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of University of the Witwatersrand, Johannesburg.
dc.rights.holder	University of the Witwatersrand, Johannesburg
dc.school	School of Computer Science and Applied Mathematics
dc.subject	Audio-driven visual dubbing
dc.subject	Self-supervised
dc.subject	Generative adversarial networks
dc.subject	UCTD
dc.subject.other	SDG-9: Industry, innovation and infrastructure
dc.title	Improving audio-driven visual dubbing solutions using self-supervised generative adversarial networks
dc.type	Dissertation

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Ranchod_Improving_2023.pdf
Size:: 3.46 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.43 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Electronic Theses and Dissertations (Masters)