











**Multiprocessor System** 

















Snooping caches assume

- · write-through caches
- · cheap "broadcast" to all CPUs
- Many alternative cache coherency protocols
- They improve performance by tackling above assumptions
- · We'll examine MESI (four state)
- Optimisations exist (MOESI, MESIF)
- · 'Memory bus' becomes message passing system between caches

# Example Coherence Protocol MESI Each cache line is in one of four states Invalid (I) • This state indicates that the addressed line is not resident in the cache and/or any data contained is considered not useful. Exclusive (E) • The addressed line is in this cache only. • The addressed line is onsistent with system memory. Shared (S) • The addressed line is valid in the cache and in at least one other cache. • A shared line is valid in the cache and in at least one other cache. • A shared line is valid in the cache and in at least one other cache. • Modified (M) • The line is modified with respect to system memory—that is, the modified data in the line has not been written back to memory.

UNSW

14

13





16





































33



### **MP Hardware Take Away**

Each core/cpu sees sequential execution of own code

- Other cores see execution affected by
- Store order and write buffers
- Cache coherence model
- Out-of-order execution

Systems software needs to understand:

- Specific system (cache, coherence, etc..)
- Synch mechanisms (barriers, test\_n\_set, load\_linked store\_cond).
- ...to build cooperative, correct, and scalable parallel code

34



36

# MP Hardware Take Away

Existing sync primitives (e.g. locks) will have appropriate fences/barriers in place

- In practice, correctly synchronised code can ignore memory model. However, racey code, i.e. code that updates shared memory
- outside a lock (e.g. lock free algorithms) must use fences/barriers.
- You need a detailed understanding of the memory coherence model.
- Not easy, especially for partial store order (ARM).



UNSW



### **Kernel Locking**

Several CPUs can be executing kernel code concurrently.

Need mutual exclusion on shared kernel data. Issues:

UNSW

Lock implementation

38

· Granularity of locking (i.e. parallelism)





## Spin locks

Disabling interrupts (CLI - STI). · Insufficient for multiprocessor systems.

Lock objects (locks, semaphores).

· Manipulating lock requires mutual exclusion.

· Busy-waiting wastes cycles.

Spin locks.

```
void lock (volatile lock_t *1) {
  while (test_and_set(l)) ;
}
void unlock (volatile lock_t *1) {
  *1 = 0;
}
Busy waits. Good idea?
```



🗿 UNSW

### Spinning versus Switching

- · Blocking and switching
  - to another process takes time
    - Save context and restore another » Cache contains current process not new
    - Adjusting the cache working set also takes time
  - » TLB is similar to cache
  - Switching back when the lock is free encounters the same again
- · Spinning wastes CPU time directly
- Trade off
- · If lock is held for less time than the overhead of switching to and back

Assume no local contention by design, is disabling interrupt

Hint: What happens if a lock holder is preempted (e.g., at

All other processors spin until the lock holder is re-scheduled

 $\Rightarrow$  It's more efficient to spin

Interrupt Disabling

end of its timeslice)?

important?

### UNSW

44

43

### Alternative to spinning: Conditional Lock (TryLock) bool cond\_lock (volatile lock t \*1) { if (test\_and\_set(1)) return FALSE; //couldn't lock else return TRUE; //acquired lock } Can do useful work if fail to acquire lock. But may not have much else to do. Livelock: May never get lock!

45





Spinning versus Switching

The general approaches taken are

· Spin for some period of time, if the lock is not acquired,

Based on previous observations of the lock

UNSW

» Fixed (related to the switch overhead)

· Spin forever

block and switch

» Dynamic

- The spin time can be

acquisition time

46

