Enhanced feature acquisition through the use of infrared imaging for human detection
Date
2020
Authors
Kunene, Dumisani
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
This thesis describes the implementation and review of human detection approaches over different spectral imaging. While significant progress on human detection has been made in the past, human detection in static images remains a challenging research problem. The performance of popular human detection systems remains inferior to the visual capability of people and animals [1, 2, 3, 4, 5, 6]. Most human detection methods are often evaluated over visible-light images. However, visible light-images can contain limited information in lowly-illuminated environments. Other complexities occur due to the possibility of random colour patterns on the image background regions and clothes of pedestrians. In most cases, the colour clutter con-tributes negatively to image representation methods that solely rely on edge information. With infrared imaging, the heat radiated from objects is often uniform and independent of the colour texture, resulting in less cluttered images. This work evaluates the significance of using imaging-infrared (IIR) footage instead of visible-light images for the human detection problem. The basis of the supposition is that the choice of extracted information has a large impact on the robustness of statistical learning systems. To test this supposition, support vector machines (SVMs) and extreme learning machines(ELMs) and convolutional neural network (CNN) classifiers were trained and tested with three different datasets. The datasets consisted of the newly created infrared-based pedestrian dataset named Significance of Near Infrared Dataset (SIGNI) [7], along with the popular National Institute for Research in Computer Science and Automation (IN-RIA) and National ICT Australia (NICTA) pedestrian colour datasets [8, 9].The classifiers were first trained with colour images to determine the optimal parameters that obtain high classification rates on unseen samples. Once satisfactory results were obtained, the same parameters were used for training the classifiers with infrared samples. This ensured non-biased classification comparisons over the different spectral images. The widely acclaimed histograms of oriented gradients (HOG) features were used as the human descriptor on the SVMs and ELMs and tested against the autonomously learned feature-maps on a CNNs. Therefore, this work provides more findings on the application of shallow learning and deep learning models in human detection and entails further experimental research on image processing methods and the classification of human beings in static images. The main rationale of this research is in addressing the lack of findings on the use of sufficiently large IIR training datasets for extracting better image features for human detection systems .Initially, the hypothesis was tested by examining the classification rate of the classifiers on six relatively-small datasets of the same size and thereafter tested on larger datasets. The SVMs obtained an average classification rate of 98.33% on the three infrared datasets, a performance gain of 0.6% than the average 97:73% that was obtained with the visual datasets. More apparent results were observed with the ELMs classifier as it achieved a gain of 8:4% with an average classification rate of 95:86% with infrared samples. The CNNs achieved the best overall classification rate than the other two classifiers, an outstanding score of 99:5% was obtained with the infrared images showing an average performance gain of 2:17% than with visual images. Performance evaluation on larger datasets showed a similar outcome as all
classifiers obtained performance gains with infrared samples. The SVMs obtained a 2:95% increase and the ELMs had a good 6:25% advantage. As on previous experiments, the CNNs scored insignificant gains of 1:66% from the relatively high
classification rates than the other two classifiers. Summing the average performance gain of each classifier on both small and larger datasets and diving it by two yields the overall performance gain of each classifier. The overall performance gain of the SVM classifier on all experiments was 1.78%, with the ELMs showing the largest gain of 7.20% and the CNNs obtaining the performance gain 1.92% respectively. The best performing classifier (CNNs) was selected for assessing the human-detection problem over the different spectral images. The assessment was conducted by running a sliding window detector over image pyramids from natural-scene images with ground-truth information. The overall classification rate on the infrared testing scenarios was4% higher than the average classification rate over the colour testing scenarios. Studying
classification rates only when comparing classifiers can be misleading. Precision rates only highlight the accuracies of classifiers solely on images they retrieved as positive thus neglect the accuracy over the entire ground-truth data. For instance, the classifiers that were trained with colour images (INRIA and NICTA)had higher precision rates, yet failed to retrieve a remarkably large number of positive ground-truth regions, achieving low recall rates of 0:043 and 0:077 respectively. This equates to only 5.98% of positive ground-truth boxes that were retrieved, versus 63%that was retrieved by the infrared classifier. The localisation experiments addressed an imbalanced binary classification problem, where one of the classes (negative background samples) had the overwhelming majority of the data samples. In such cases, precision rates are not a good measure because they can be easily obtained by classifiers that have a bias to the overwhelming class. Instead, the recall rate metric becomes a better measure as it shows the model's ability to nd relevant samples from a testing scenario that has considerably more irrelevant samples. The ground-truth results show that the precision and recall rates of the infrared model were both fair, unlike the visual models, where the classifiers had higher precision rates and substantially poor recall rates. Therefore, throughout all experiments, better results were obtained with the use of infrared images than the use of visual images by all classifiers and the CNNs performed well than the two shallow learning classifiers
Description
A dissertation submitted in partial fulfilment of the requirements for the degree Master of Science, School of Computer Science and Applied Mathematics, Faculty of Science, University of the Witwatersrand, 2020