💾 Archived View for uscoffings.net › tech › programming › memory-barriers.gmi captured on 2023-03-20 at 18:10:35. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Memory Barriers

[date: 2013-10-04]

I can't seem to keep the details of memory ordering and memory barriers fresh in my mind, so this time I am taking notes.

References

Memory Ordering in Modern Microprocessors by Paul E. McKenney

Reordering Constraints for Pthread-Style Locks](http://www.hpl.hp.com/techreports/2005/HPL-2005-217R1.pdf by Hans-J. Boehm

http://www.mjmwired.net/kernel/Documentation/memory-barriers.txt

Background

Processors used to be slow, in-order, and single-core. Now they are fast, out-of-order, and multi-core.

It's the out-of-order memory accesses, combined with multi-core, that really cause the problems. Where do these out-of-order memory accesses come from?

The compiler may reorder your code, at least between sequence points.
The CPU may execute the instructions out-of-order. The visible effect is that memory references occur out of order. Processors do this to varying degrees, as summarized on "Memory Ordering" at Wikipedia.
CPUs may see other CPUs' writes out-of-order, due to per-CPU cache design. Section "Why Reorder Memory Accesses?" of [1] discusses this. In summary, CPUs are now so fast that each needs multiple independent cache banks. One bank may be busy, while the other may be idle. A write to the idle bank may become globally visible before an earlier write to the busy bank.

Types of Barriers (Fences)

compiler memory barrier (nothing emitted into code) vs hardware memory barrier

Of course in practice, macros which emit hardware memory barriers also will emit a compiler memory barrier.

read/write/...

acquire / release semantics

release barrier: All previous memory stores are globally visible, and all previous memory loads have been satisifed, but following memory reads are not prevented from being speculated to before the barrier.

acquire barrier: References after the builtin cannot move to (or be speculated to) before the builtin, but previous memory stores may not be globally visible yet, and previous memory loads may not yet be satisfied.

Barriers in Locking Primitives

Locking primitives typically act as some sort of memory barrier.

Linux Kernel

The Linux kernel clearly documents that lock/unlock critical sections are not fully fenced. Loosely speaking, memory accesses outside can "leak" into the critical section, but accesses inside cannot leak out. See [3].

Pthreads

I used to think that using pthreads in userspace was simple and safe. My understanding (assumption?) was that each pthread mutex call was, effectively, a full memory barrier. That is, memory accesses before the lock were completed strictly before; memory accesses between the lock and unlock were completed between, and memory accesses after the unlock remained after.

It's not clear that's a safe assumption. Quoting [2]:

It is clearly not generally acceptable to move memory operations out of critical sections, i.e. regions in which a lock is held; doing so would introduce a data race. But it is far less clear whether it is acceptable to move memory operations into critical sections, which is the question we pursue here.

...

...a memory operation immediately following a pthread_mutex_unlock operation may always be moved to just before the pthread_mutex_unlock. But the corresponding reordering of a pthread_mutex_lock with a preceding memory operation is not generally safe. More general movement into critical sections is possible in the absence of the pthread_mutex_trylock call.

As we point out in section 3, many current implementations either do not follow these rules, or add extra fences. Nor is it completely apparent that they should, since these semantics appear to have been largely accidental.

To me, this sounds like the only safe, portable expectation of pthreads mutexes is to treat them like Linux kernel locks: The enclosed memory references cannot leak out, but other memory refernces can leak in.

In fact, Table 1 of [2] shows this through code inspection.

To state it clearly: *pthreads locks are not full memory barriers*.

C++11

Example

struct T
{
	atomic_t refs;
};

struct T *alloc_t()
{
	struct T *t = kmalloc(sizeof(struct T));
	atomic_set(&t->refs, 1);
	return t;
}

static void free_t(struct T *t)
{
}

void get_t(struct T *t)
{
}

void put_t(struct T *t)
{
}