Write Serialization for Cache Coherence in the presence of Store Buffers

Question

One of the requirements for a coherent memory system is write serialization - "two writes to address X by any two processors are observed in the same order by all processors". I am not sure how this condition would be met when the CPU cores have a store buffer into which stores are retired before updating the memory hierarchy.

Suppose Cores A and B both retire a store to address X (which is cached by both A and B), and these stores are now in the store buffers and are yet to update the cache. Now loads from address X on both cores A and B could retire after obtaining the load value from the store buffers ie loads from X on core A see value written by the store to X on core A, similarly, loads from X on core B see value written by the store to X on core B. Say now, the store to X from core A sends a read-upgrade message to core B. What happens next on core B? After coherence transactions have completed for A, the store to X on B updates the local cache. After the stores to X from A and B have completed the final value of X is the store by B.

Above, the order for values seen by A for X are write by A followed by the write to B. But this doesn't seem to be the order seen by B. So the question is how is write serialization enforced when store buffers are being used?

(This is based on my understand of the various material I have read and would appreciate it if anyone can point to sources that discuss these issues in detail)

score 7 · Answer 1 · answered Nov 18 '16 at 20:59

From a coherence perspective, I think your example is coherent. All processors believe that the write and read from A happened first, then the write and read from B happened later.

From a consistency standpoint you need to be more careful. (Consistency is the global ordering of memory operations to different addresses.) The way consistency is handled in practice is that the store buffer sits before retirement, not after. The store buffer acts as a little coherent cache for the speculative memory state. A store instruction is not permitted to retire until its processor has acquired write ownership of the appropriate cache line.

The store buffer needs to be aware of all the coherence traffic reaching the cache. If a speculative load gets its value from the store buffer, but the corresponding memory location is invalidated before the load retires, the load (and everything following it in the reorder-buffer) needs to be squashed and reexecuted.

score 5 · Answer 2 · edited Nov 18 '16 at 15:47

Situations like the one you describe are the reason why processor manuals for architectures with store buffers such as intel tend to state that two stores by cores i and j are seen in the same order by all other cores.

Common techniques for enforcing sequential consistency for certain memory locations include fence instructions and bus locking. How these techniques work exactly depends somewhat on the architecture. How they work in principle is described in modern textbooks such as Michael L Scott's Shared-Memory Synchronization. Details for architectures are contained in the manufacturers' handbooks. For instance, Intel publishes 4,670p software developer manuals that describe what they think they are doing. (This does of course include much more but I'm not aware of any other comprehensive source.) Scientists who try to prove anything about the behaviour of their programs when running on Intel's x86 or ARM nowadays are rather fond of the formalisations Peter Sewell et. al. in Cambridge (UK) produced. These are still abstractions of what's really going on but they do model store buffers and their effect. These formalisations have been experimentally verified. For a start I'd recommend their Communication of the ACM 2010 paper. More can be found at his web page.

score 2 · Answer 3 · answered Oct 22 '17 at 03:32

The short answer is this: Store buffers make the memory system fast and inconsistent.

Consider a file system that always does the right thing; a write to a file is immediately visible to all others, and that it handles concurrent writes from multiple clients correctly. Everyone can see the writes serialized. It is a strongly consistent file system.

Now imagine that the programmer creates a convenience library that buffers all writes, and periodically delivers them to the file system. This works just fine if the file is not shared. There is no inconsistency if there is no other viewer. But it is clearly a problem if the file is shared, and it is not the file system's fault. Adding buffering messed it up.

If you treat the convenience library as the interface between the file system and your program (in effect, the API is the file system for all you care), then all you see is an inconsistent file system. On the other hand, if you know about the buffering, and if the API provides a sync() call where it waits until the buffers are drained, then a write(X); sync() combination on one client and a write(Y); sync() combination on another client is equivalent to having made calls to the underlying API. We are back to regaining consistency, at the expense of waiting.

In your question, the cache coherent memory is the file system, the store buffer is the user-side buffer, and the sync() call is an mfence or equivalent "barrier" instruction.

The cache coherence protocol (MESI) is strongly consistent (linearizable), but requires coordination, which can suck out performance. We know that most memory (95% or more) is never shared with another process, so why make the CPU wait unnecessarily for all accesses? Hence store buffering and no fences required.

However, for those cases where it is absolutely imperative that we need strong consistency (updates to shared memory addresses), then the program must issue a fence instruction. In higher-level languages like C++ and Java, there are several constructs and qualifiers (synchronised, volatile, final) that introduce these fences at the right moments. But if you mistakenly omit to use one of these facilities, prepare to spend a long time figuring out what happened. This is what makes shared memory concurrent programming so error-prone. And this is why Java programmers should know about store buffers.

Write Serialization for Cache Coherence in the presence of Store Buffers

3 Answers3