











**Multiprocessor System** 



















Snooping caches assume

- · write-through caches
- · cheap "broadcast" to all CPUs
- Many alternative cache coherency protocols
- They improve performance by tackling above assumptions
- · We'll examine MESI (four state)
- Optimisations exist (MOESI, MESIF)
- · 'Memory bus' becomes message passing system between caches

# Example Coherence Protocol MESI Each cache line is in one of four states Invalid (I) • This state indicates that the addressed line is not resident in the cache and/or any data contained is considered not useful. Exclusive (E) • The data in this line is consistent with system memory. The data in this line is consistent with system memory. Shared (S) • The addressed line is valid in the cache and in at least one other cache. • A shared line is always consistent with system memory. That is, the shared state is shared-uncodified; there is no shared-modified state. Modified (M) • The line is valid in the cache and in only this cache. • The line is valid in the cache and in only this cache. • The line is walid in the cache and in only this cache. • The line is envilled make to memory.—that is, the modified data in the line has not been written back to memory.

14

13





16











store 1, X

store 1, Y load r2, X

load r2, Y

store 1, Y

store 1, X load r2, Y

load r2, X

UNSW

X=1,Y=1

X=1, Y=1

20



21



**Potential interleavings** 

























33



MP Hardware Take Away

Each core/cpu sees sequential execution of own code

- Other cores see execution affected by
- Store order and write buffers
- Cache coherence model
- Out-of-order execution

Systems software needs to understand:

- Specific system (cache, coherence, etc..)
- Synch mechanisms (barriers, test\_n\_set, load\_linked store\_cond).
- ...to build cooperative, correct, and scalable parallel code

34



36

MP Hardware Take Away

Existing sync primitives (e.g. locks) will have appropriate fences/barriers in place

- In practice, correctly synchronised code can ignore memory model. However, racey code, i.e. code that updates shared memory
- outside a lock (e.g. lock free algorithms) must use fences/barriers.
- · You need a detailed understanding of the memory coherence model.
- Not easy, especially for partial store order (ARM).



UNSW



### **Kernel Locking**

Several CPUs can be executing kernel code concurrently.

Need mutual exclusion on shared kernel data. Issues:

UNSW

- Lock implementation
- · Granularity of locking (i.e. parallelism)



38



39

Spin locks.

| <pre>void lock (volatile lock_t *1) {</pre>   |
|-----------------------------------------------|
| <pre>while (test_and_set(l)) ;</pre>          |
| }                                             |
| <pre>void unlock (volatile lock_t *1) {</pre> |
| *1 = 0;                                       |
| }                                             |
| Busy waits. Good idea?                        |
|                                               |
|                                               |
|                                               |
|                                               |
|                                               |



🗿 UNSW

### Spinning versus Switching

- · Blocking and switching
  - to another process takes time
    - » Save context and restore another
       » Cache contains current process not new
    - Adjusting the cache working set also takes time
  - » TLB is similar to cache
  - Switching back when the lock is free encounters the same again
- · Spinning wastes CPU time directly
- Trade off
- If lock is held for less time than the overhead of switching to and back

Assume no local contention by design, is disabling interrupt

Hint: What happens if a lock holder is preempted (e.g., at

All other processors spin until the lock holder is re-scheduled

 $\Rightarrow$  It's more efficient to spin

Interrupt Disabling

end of its timeslice)?

important?

### 

44

43

45

## Alternative to spinning: Conditional Lock (TryLock) bool cond\_lock (volatile lock t \*1) { if (test\_and\_set(1)) return FALSE; //couldn't lock else return TRUE; //acquired lock } Can do useful work if fail to acquire lock. But may not have much else to do. Livelock: May never get lock!

Spinning versus Switching

The general approaches taken are

· Spin for some period of time, if the lock is not acquired,

Based on previous observations of the lock

» Fixed (related to the switch overhead)

Spin forever

block and switch

» Dynamic

- The spin time can be

acquisition time

46

UNSW







UNSW







70 60 50 Elapsed time (sec.) 40 spin test&set 30 spin on read 20 10 0 17 9 13 number of processors 🗿 UNSW

# Benchmark

for i = 1 .. 1,000,000 {
 lock(1)

- crit\_section()
- unlock() compute()
- compute

Compute chosen from uniform random distribution of mean 5 times critical section

Measure elapsed time on Sequent Symmetry (20 CPU 30386, coherent write-back invalidate caches)

52

### Results

Test and set performs poorly once there is enough CPUs to cause contention for lock

- Expected
- Test and Test and Set performs better
- · Performance less than expected
- Still significant contention on lock when CPUs notice release and all attempt acquisition

Critical section performance degenerates

- Critical section requires bus traffic to modify shared structure
  Lock holder competes with CPU that missed as they test and set

  lock holder is slower
- Slower lock holder results in more contention



# **Examining Inserting Delays** TABLE III LAY AFTER SPINNER NOTICES RELEASED LOCK BUSY of Tes Set (Lock) = BUSY) while (lock = BUSY or 1 begin while (lock = BUSY) ; Delay (); end; TABLE IV DELAY BETWEEN EACH REFERENCE while (lock = BUSY or TestAndSet (lock) = BUSY) Delay (): Lock

56

### **Queue Based Locking**

Each processor inserts itself into a waiting queue

- · It waits for the lock to free by spinning on its own separate cache line
- · Lock holder frees the lock by "freeing" the next processors cache line.



57

### Results

- Static backoff has higher overhead when backoff is inappropriate
- Dynamic backoff has higher overheads when static delay is appropriate
- · as collisions are still required to tune the backoff time
- Queue is better when contention occurs, but has higher overhead when it does not.
- · Issue: Preemption of queued CPU blocks rest of queue (worse than simple spin locks)











63







Compared

- · test and test and set
- · Anderson's array based queue
- · test and set with exponential back-off
- MCS













Can we dynamically switch locking methods to suit the current contention level???



### **Protocol Selection**

### Keep a "hint"

- Ensure both TTS and MCS lock a never free at the same time
- Only correct selection will get the lock
- Choosing the wrong lock with result in retry which can get it right next time

Mode

Queue Lock

Test&Test&Set Loc

- Assumption: Lock mode changes infrequently - hint cached read-only
- infrequent protocol mismatch retries

74



75

