The impact of encoded features from action classifiers on dense video captioning performance

No Thumbnail Available

Date

2021

Authors

Bhugwan, Dhruv

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Action classification is the process of identifying what action is being performed in a video. Dense video captioning is the process of identifying time segments within a video where individual events occur and describing each event with a caption. 3DConvolutional Neural Networks (3D CNNs) pretrained on an action classification dataset are used to encode a video from a video captioning dataset into a set of features. These features are then passed to a temporal action proposal and captioning modules to identify the time segments and their respective captions. For our work, we consider only the 3D CNNs used to generate features on the ActivityNet Captions dataset to investigate their impact on the dense video captioning performance. Features generated on the C3D architecture pretrained on the Sports1M dataset are commonly used due to their availability. To improve feature performance we utilize the Kinetics-368 dataset which does not only contain sporting action classes such as the Sports1M dataset but contains everyday action classes as well. We propose a 3D version of the MobileNetV3 architecture to investigate the performance of resource efficient action classifiers as feature extractors. Furthermore, we propose a 3D attention estimator module that can be integrated into existing 3D CNN architectures. We integrate the 3D attention estimator module into the C3D and 3D MobileNetV3 architectures and train the C3D, 3D MobileNetV3 and their attention augmented counterparts on the Kinetics-368 dataset. Thereafter we generate a set of features on the ActivityNet Captions dataset using each of the aforementioned 3D CNNs. We empirically show that our classification accuracy on the Kinetics-368 dataset can improve when integrating the 3D attention estimator module into the C3D architecture where we found gains of 8.87%. The 3D MobileNetV3 architecture obtained the best classification accuracy on the Kinetics-368 dataset, but the structure of the network bottle-necked the amount of information provided to the attention estimator module resulting in a performance decrease of 2.63% indicating that the base architecture structure must be considered before integrating the 3D attention estimator module. We find that the features generated using 3D MobileNetV3 have the largest loss of information due to PCA being applied to them but they achieve higher scores in half of our temporal action proposal and caption metrics. This indicates that classification performance is indicative of temporal proposal and captioning performance. Furthermore, we show that the scale of the dataset used to pretrain the 3D CNNs used for feature extraction is a large contributing factor to dense video captioning performance. The Sports1M dataset has approximately 6 times more videos per action class compared to the Kinetics-368 dataset. Across all metrics, we find that our features obtained temporal action proposal and caption scores far lower than the features generated on theC3D architecture pretrained on the Sports1M dataset

Description

A dissertation submitted to the Faculty of Science, University of the Witwatersrand, Johannesburg, in fulfilment of the requirements for the degree of Master of Science in Computer Science, 2021

Keywords

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By