2

I have been studying the assosciative mapping in a cache.Every line in a cache has a unique address which is made of:the index and the offset bits which give the the cache set and the cache line and the tag bits.But which algorithm really decides where to put information from RAM into the cache?

Cerise
  • 153
  • 5

1 Answers1

1

Preliminaries

In the case of a fully-associative or set-associative cache, there are multiple places where a cache line could be placed, which usually involves evicting an existing cache line.

From the classic paper Design of CPU Cache Memories by Alan Jay Smith, the mean memory access time is

$$T = m \times T_m + T_h + E$$

where $m$ is the miss ratio, $T_m$ is the time to satisfy a cache miss, $T_h$ is the time to satisfy a cache hit, and $E$ is the overhead of everything else in the system.

This is the number that you would typically try to optimise for.

The only numbers that the cache replacement policy have any control over are $m$ and $T_m$. It can affect the miss ratio, but it can also affect the time to satisfy a cache miss, because some evictions are more expensive than others.

Fair warning, there is a huge design space to be explored here, and different algorithms tend to work better on different workloads. Although, as a general rule, pretty much any cache-replacement policy will work as well as any other when the working set is smaller than the cache size.

A CPU has access to quite a lot of information beyond "what is currently sitting in the cache", so let's take a look at some questions you can ask.

Is there a store pending?

On modern CPUs, stores are often buffered to reduce latency, and support speculative execution. This is especially useful if sequential consistency is not needed. For example, an extremely common case is issuing multiple stores to adjacent locations in the same cache line. These can be buffered until the cache line has been read, an optimisation known as write-combining.

If there is a pending store to a cache line, it would probably be a bad idea to evict that particular one.

Does it need to be flushed?

If a cache line is dirty, then it needs to be flushed to somewhere. This would typically be to RAM or a lower-level cache, but in a multiprocessing system, it's sometimes to another CPU's cache.

If the cache line is clean, then (to a first approximation) it just needs to be forgotten. This is much cheaper, so it may be worth prioritising clean cache lines.

Was it used recently?

Part of the theory behind why cache works in the first place is that the working set of a program, at any particular point in time, is smaller than the program's size. Therefore, the less recently a cache line has been used, the less likely it is to be needed soon.

Is there a detectable access pattern?

Modern CPUs, in particular, often have hardware to detect one specific case: scanning through an area of (virtual!) memory sequentially. Iterating through arrays, copying blocks of memory, etc are such common operations that this is almost always worth doing.

There are some optimisations a CPU can do when it detects this:

  • Prefetch the next likely cache line in the sequential scan, even before it's been requested.
  • Mark or record the previously used cache line in the scan as the most likely candidate for eviction. So, for example, the CPU might eagerly write-through such a cache line, even if this is not normally a write-through cache.

Some example policies

With all that in mind, here are some specific policies. Bear in mind that all of the practical ones could be supplemented with some of the special-case optimisations mentioned above.

Bélády's algorithm

The optimal algortihm is Bélády's algorithm, also known as the clairvoyant algorithm. You simply evict the cache line that will not be needed for the longest time in the future.

This algorithm has the disadvantage that it's impossible to implement because it is not an online algorithm: it cannot use only information from the past and the present. But it's a useful theoretical tool for comparing other algorithms. No algorithm can do better than this, so an algorithm which gets close is, by definition, pretty good.

While we're on the topic, it would also be useful to know the worst-case, and turns out that, in the absence of information about the future, there is no deterministic online "worst" cache replacement algorithm, either. There are, however, some good (if that is the right word) approximations. For more on this, see Agrawal et al, The Worst Page-Replacement Policy.

Least recently used

The idea behind LRU is that you keep track of when a cache line is used, and if a cache line needs to be evicted, you evict the one that was used least recently. This need not be expensive; you could use a counter or a queue where an access moves that cache line to the front of the queue, and items are evicted off the back of the queue.

This seems like a good algorithm, but it has one very large drawback.

If the working set is smaller than or equal to the cache size, this algorithm performs extremely well. But for the case where the working set size is only just larger than the cache size, this algorithm can perform worst of all.

Consider a simple case of a program that accesses addresses 1, 2, 3, 4, and 5 in order, over and over again, and further suppose that the cache will only fit four addresses. We start by loading 1, 2, 3, then 4. When it comes time to access 5, an LRU replacement policy will eject 1, since it was least recently used. This is, however, also the next address that will be needed. And then, when it comes time to access 1 again, 2 will be evicted, since it was least recently used, which also happens to be the next address that will be needed.

So in this case, even though almost all of the working set will fit in cache, LRU has a cache miss rate of close to 100%.

A better algorithm in this scenario would be...

Random Replacement

Just pick one at random. Well, pseudo-random, but you get the idea. On average, this algorithm works extremely well considering how simple it is, and many CPUs (e.g. ARM) use it. It also has the advantage that no history information is needed.

Not Recently Used

This is kind of the converse of LRU. Essentially, you record which cache line was most recently used, and eject some other one. If you choose the other at random, this is known as Not Recently Used, or if you approximate a LRU cache somehow, it might be Pseudo-LRU.

NRU policies are very popular for implementing page replacement in operating systems and databases. You can think of the BSD two-handed clock algorithm as a kind of NRU policy, for example.

Bélády approximations

One broad set of approaches is based on the idea of using the same techniques used to speculate branches, to also speculate future cache needs. The theory here is that some instructions are more likely to generate cache-friendly accesses than others, so by using history and program counter information, we might be able to use information about the past to predict the future.

Some example algorithms that use this include Hawkeye, Harmony, and Mockingjay.

Re-Reference Interval Prediction

RRIP is a flexible family of policies proposed by Intel in 2010. It uses an integer (called the re-reference prediction value, or RRPV) associated with each cache line, which correlates with when the cache line is expected to be reused. When a cache line is needed, the one with the highest RRPV is evicted.

There is a large design space here, here is one example:

  • When a cache line is loaded, its RRPV is set to the maximum value.
  • When a cache line is reused, its RRPV is set to zero.
  • When a cache line needs to be evicted, any cache line with the maximum RRPV is evicted. If none of them have the maximum value, then all RRPVs in the set are incremented until one does.

This policy performs extremely well when "scanning" (i.e. accessing a lot of memory consecutively, as with a memory copy), because it evicts cache lines that are used only once, while also allowing reused cache lines to stay in cache. However, it can cause thrashing if the working set is larger than cache, because it essentially degenerates to "most recently used".

An enhancement is to randomly set the RRPV of newly-loaded cache lines either to the maximum value or the maximum value minus one. This gives some cache lines a random chance to "stick" in cache.

Set dueling

Given that no single policy will probably work best for all computing scenarios, it can sometimes pay to try more than one.

In a multi-way set-associative cache, you could assign some of the sets to one policy and some the sets to a different policy, along with a counter that monitors which policy is doing better at the moment, and adjusting the number of sets which use which policy accordingly. This is known as set dueling.

Pseudonym
  • 24,523
  • 3
  • 48
  • 99