Yu. A. Nedbailo, A. V. Surchenko, I. N. Bychkov, “Reducing miss rate in a non-inclusive cache with inclusive directory of a chip multiprocessor”, Computer Research and Modeling, 2023, Volume 15, Issue 3,Pages <nobr>639

This article is cited in 1 paper

MODELS IN PHYSICS AND TECHNOLOGY

Reducing miss rate in a non-inclusive cache with inclusive directory of a chip multiprocessor

Yu. A. Nedbailo^ab, A. V. Surchenko^a, I. N. Bychkov^ab

^a MCST JSC, 108 Profsoyuznaya st., Moscow, 117437, Russia
^b INEUM im. I. S. Bruka, 24 Vavilova st., Moscow, 119334, Russia

Abstract: Although the era of exponential performance growth in computer chips has ended, processor core numbers have reached 16 or more even in general-purpose desktop CPUs. As DRAM throughput is unable to keep pace with this computing power growth, CPU designers need to find ways of lowering memory traffic per instruction. The straightforward way to do this is to reduce the miss rate of the last-level cache. Assuming “non-inclusive cache, inclusive directory” (NCID) scheme already implemented, three ways of reducing the cache miss rate further were studied.
The first is to achieve more uniform usage of cache banks and sets by employing hash-based interleaving and indexing. In the experiments in SPEC CPU2017 refrate tests, even the simplest XOR-based hash functions demonstrated a performance increase of 3.2%, 9.1%, and 8.2% for CPU configurations with 16, 32, and 64 cores and last-level cache banks, comparable to the results of more complex matrix-, division- and CRC-based functions.
The second optimisation is aimed at reducing replication at different cache levels by means of automatically switching to the exclusive scheme when it appears optimal. A known scheme of this type, FLEXclusion, was modified for use in NCID caches and showed an average performance gain of 3.8%, 5.4%, and 7.9% for 16-, 32-, and 64-core configurations.
The third optimisation is to increase the effective cache capacity using compression. The compression rate of the inexpensive and fast BDI*-HL (Base-Delta-Immediate Modified, Half-Line) algorithm, designed for NCID, was measured, and the respective increase in cache capacity yielded roughly 1% of the average performance increase.
All three optimisations can be combined and demonstrated a performance gain of 7.7%, 16% and 19% for CPU configurations with 16, 32, and 64 cores and banks, respectively.

Keywords: multicore processor, memory subsystem, distributed shared cache, NCID, XOR-based hash function, data compression.

UDC: 004.318

Received: 14.04.2023
Accepted: 03.05.2023

Language: English

DOI: 10.20537/2076-7633-2023-15-3-639-656