Chunked extendible arrays and its integration with the global array toolkit for parallel image processing

Abstract
Several meetings of the Extremely Large Databases Community for large scale scientific applications have advocated the use of multidimensional arrays as the appropriate model for representing scientific databases. Scientific databases gradually grow to massive sizes of the order of terabytes and petabytes. As such, the storage of such databases requires efficient dynamic storage schemes where the array is allowed to arbitrarily extend the bounds of the dimensions. Conventional multidimensional array representations in today’s programming environments do not extend or shrink their bounds without relocating elements of the data-set. In general extendibility of the bounds of the dimensions is limited to only one dimension. This thesis presents a technique for storing dense multidimensional arrays by chunks such that the array can be extended along any dimension without compromising the access time of an element. This is done with a computed access mapping function that maps the k-dimensional index onto a linear index of the storage locations. This concept forms the basis for the implementation of an array file of any number of dimensions, where the bounds of the array dimension can be extended arbitrarily. Such a feature currently exists in the Hierarchical Data Format version 5 (HDF5). However, extending the bound of a dimension in the HDF5 array file can be unusually expensive in time. Such extensions, in our storage scheme for dense array files, can be performed while still accessing elements of the array at orders of magnitude faster than in HDF5 or conventional array-files. We also present Parallel Chunked Extendible Dense Array (PEXTA), a new parallel I/O model for the Global Array Toolkit. PEXTA provides the necessary Application Programming Interface (API) for explicit data transfer between the memory resident global array and its secondary storage counterpart but also allows the persistent array to be extended on any dimension without compromising the access time of an element or sub-array elements. Such APIs provide a platform for high speed and parallel hyperspectral image processing without performance degradation, even when the imagery files undergo extensions.
Description
A thesis submitted to the Faculty of Engineering and the Built Environment in fulfilment of the requirements for the degree of Doctor of Philosophy, 2016
Online resource (xii, 151 leaves)
Keywords
Citation
Nimako, Gideon (2016) Chunked extendible arrays and its integration with the global array toolkit for parallel image processing, University of the Witwatersrand, <http://hdl.handle.net/10539/22332>
Collections