Abstract:
Algorithms for implementing collective operations MPI_Bcast, MPI_Reduce, MPI_Allreduce using shared memory of multiprocessor servers are proposed. The algorithms create a shared memory segment and a system of queues in it, through which message blocks are transmitted. The software implementation is based on the Open MPI library as an isolated coll/sharm component. Unlike existing algorithms, interaction with the queuing system is organized with spinlock and focused on reducing the number of barrier synchronizations and atomic operations. When conducting experiments on a server with x86–64 architecture for the MPI_Bcast operation, the largest reduction in time was obtained by 6.5 times (85% less) and MPI_Reduce by 3.3 times (70% less) compared to the implementation in the coll/tuned component of the Open MPI library. Recommendations on the use of algorithms for different message sizes are suggested.