Sparse array representations and some selected array operations on GPUs
Date
2014-09-01
Authors
Wang, Hairong
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
A multi-dimensional data model provides a good conceptual view of the data in data warehousing and On-Line
Analytical Processing (OLAP). A typical representation of such a data model is as a multi-dimensional array
which is well suited when the array is dense. If the array is sparse, i.e., has a few number of non-zero elements
relative to the product of the cardinalities of the dimensions, using a multi-dimensional array to represent the
data set requires extremely large memory space while the actual data elements occupy a relatively small fraction
of the space. Existing storage schemes for Multi-Dimensional Sparse Arrays (MDSAs) of higher dimensions
k (k > 2), focus on optimizing the storage utilization, and offer little flexibility in data access efficiency.
Most efficient storage schemes for sparse arrays are limited to matrices that are arrays in 2 dimensions. In
this dissertation, we introduce four storage schemes for MDSAs that handle the sparsity of the array with two
primary goals; reducing the storage overhead and maintaining efficient data element access. These schemes,
including a well known method referred to as the Bit Encoded Sparse Storage (BESS), were evaluated and
compared on four basic array operations, namely construction of a scheme, large scale random element access,
sub-array retrieval and multi-dimensional aggregation. The four storage schemes being proposed, together
with the evaluation results are: i.) The extended compressed row storage (xCRS) which extends CRS method
for sparse matrix storage to sparse arrays of higher dimensions and achieves the best data element access
efficiency among the methods compared; ii.) The bit encoded xCRS (BxCRS) which optimizes the storage
utilization of xCRS by applying data compression methods with run length encoding, while maintaining its
data access efficiency; iii.) A hybrid approach (Hybrid) that provides the best control of the balance between
the storage utilization and data manipulation efficiency by combining xCRS and BESS. iv.) The PATRICIA
trie compressed storage (PTCS) which uses PATRICIA trie to store the valid non-zero array elements. PTCS
supports efficient data access, and has a unique property of supporting update operations conveniently. v.)
BESS performs the best for the multi-dimensional aggregation, closely followed by the other schemes.
We also addressed the problem of accelerating some selected array operations using General Purpose Computing
on Graphics Processing Unit (GPGPU). The experimental results showed different levels of speed up,
ranging from 2 to over 20 times, on large scale random element access and sub-array retrieval. In particular, we
utilized GPUs on the computation of the cube operator, a special case of multi-dimensional aggregation, using
BESS. This resulted in a 5 to 8 times of speed up compared with our CPU only implementation. The main
contributions of this dissertation include the developments, implementations and evaluations of four efficient
schemes to store multi-dimensional sparse arrays, as well as utilizing massive parallelism of GPUs for some
data warehousing operations.
Description
A dissertation submitted to the Faculty of Science, University of the Witwatersrand, Johannesburg, in fulfilment of the requirements for the degree of Master of Science. Johannesburg, 2014.