Improving audio-driven visual dubbing solutions using self-supervised generative adversarial networks

dc.contributor.authorRanchod, Mayur
dc.contributor.supervisorKlein, Richard
dc.date.accessioned2024-10-23T15:17:37Z
dc.date.available2024-10-23T15:17:37Z
dc.date.issued2023-09
dc.descriptionA dissertation submitted in fulfilment of the requirements for the degree of Master of Science, to the Faculty of Science, School of Computer Science & Applied Mathematics, University of the Witwatersrand, Johannesburg, 2023.
dc.description.abstractAudio-driven visual dubbing (ADVD) is the process of accepting a talking-face video, along with a dubbing audio segment, as inputs and producing a dubbed video such that the speaker appears to be uttering the dubbing audio. ADVD aims to address the language barrier inherent in the consumption of video-based content caused by the various languages in which videos may be presented. Specifically, a video may only be consumed by the audience that is familiar with the spoken language. Traditional solutions, such as subtitles and audio-dubbing, hinder the viewer’s experience by either obstructing the on-screen content or introducing an unpleasant discrepancy between the speaker’s mouth movements and the input dubbing audio, respectively. In contrast, ADVD strives to achieve a natural viewing experience by synchronizing the speaker’s mouth movements with the dubbing audio. A comprehensive survey of several ADVD solutions revealed that most existing solutions achieve satisfactory visual quality and lip-sync accuracy but are limited to low-resolution videos with frontal or near frontal faces. Since this is in sharp contrast to real-world videos, which are high-resolution and contain arbitrary head poses, we present one of the first ADVD solutions trained with high-resolution data and also introduce the first pose-invariant ADVD solution. Our results show that the presented solution achieves superior visual quality while also achieving high measures of lip-sync accuracy, consequently enabling the solution to achieve significantly improved results when applied to real-world videos.
dc.description.submitterMM2024
dc.facultyFaculty of Science
dc.identifier0000-0001-6537-2727
dc.identifier.citationRanchod, Mayur. (2023). Improving audio-driven visual dubbing solutions using self-supervised generative adversarial networks. [Master's dissertation, University of the Witwatersrand, Johannesburg]. https://hdl.handle.net/10539/41906
dc.identifier.urihttps://hdl.handle.net/10539/41906
dc.language.isoen
dc.publisherUniversity of the Witwatersrand, Johannesburg
dc.rights©2023 University of the Witwatersrand, Johannesburg. All rights reserved. The copyright in this work vests in the University of the Witwatersrand, Johannesburg. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of University of the Witwatersrand, Johannesburg.
dc.rights.holderUniversity of the Witwatersrand, Johannesburg
dc.schoolSchool of Computer Science and Applied Mathematics
dc.subjectAudio-driven visual dubbing
dc.subjectSelf-supervised
dc.subjectGenerative adversarial networks
dc.subjectUCTD
dc.subject.otherSDG-9: Industry, innovation and infrastructure
dc.titleImproving audio-driven visual dubbing solutions using self-supervised generative adversarial networks
dc.typeDissertation
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Ranchod_Improving_2023.pdf
Size:
3.46 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.43 KB
Format:
Item-specific license agreed upon to submission
Description: