Abstract:
A parallel Cholesky factorization algorithm for sparse matrices has been implemented, based on the asynchronous task paradigm and accounting for the specifics of NUMA architecture. The execution of the numerical factorization stage and forward/backward substitution is represented as a directed acyclic graph (DAG), which removes synchronization barriers and enhances data access locality to improve the utilization efficiency of the computational device's memory hierarchy. Performance evaluation demonstrates good scalability compared to the highly optimized commercial package Intel MKL PARDISO, confirming the effectiveness of the proposed approach.