Abstract:
We are considering a parallel implementation of matrix-vector
multiplication (GEMV, Level 2 of the BLAS) for graphics processing units (GPUs)
using multiple-precision arithmetic based on the residue number system. In our
GEMV implementation, element-wise operations with multiple-precision vectors
and matrices consist of several parts, each of which is calculated by a separate
CUDA kernel. This feature eliminates branch divergence when performing
sequential parts of multiple-precision operations and allows the full utilization
of the GPU’s resources. An efficient data structure for storing arrays with
multiple-precision entries provides a coalesced access pattern to the GPU global
memory. We have performed a rounding error analysis and derived error bounds
for the proposed GEMV implementation. Experimental results show the high
efficiency of the proposed solution compared to existing high-precision packages
deployed on GPU.
Key words and phrases:multiple-precision computations, BLAS, GEMV, parallel algorithms,
CUDA, GPU, residue number system.