Multiprocessor & SIMD

The final week covers parallel architectures: Flynn's taxonomy (SISD, SIMD, MISD, MIMD), SIMD/vector processing, shared-memory multiprocessors, and the cache coherence problem (MSI, MESI protocols). We revisit Amdahl's Law for parallel computing.

Learning Objectives

Classify architectures using Flynn's taxonomy

Explain SIMD processing and its applications

Understand shared memory models (UMA, NUMA)

Trace cache coherence protocols (MSI, MESI)

Key Concepts

Flynn's Taxonomy & SIMD

Flynn's taxonomy classifies architectures by instruction and data streams:

-SISD: single instruction, single data (traditional uniprocessor)
-SIMD: single instruction, multiple data (vector processors, GPU cores)
-MISD: multiple instruction, single data (rare, fault-tolerant systems)
-MIMD: multiple instruction, multiple data (multiprocessors, multi-core)

SIMD applies the same operation to multiple data elements simultaneously — ideal for array operations, graphics, and scientific computing.

-
SIMD examples: SSE, AVX (x86), NEON (ARM), GPU shader cores
-
SIMD works best when the same operation applies to many data elements
-
Vector length determines how many elements are processed per instruction
-
Data must be aligned and contiguous for efficient SIMD access

Cache Coherence

In a shared-memory multiprocessor, each core has its own cache. When one core modifies a shared variable, other cores' caches may hold stale copies. Cache coherence protocols ensure all caches see a consistent view of memory.

MSI protocol states: Modified (dirty, exclusive), Shared (clean, may be in other caches), Invalid. MESI protocol adds Exclusive (clean, only copy) to reduce bus traffic.

-
Coherence problem: multiple caches holding copies of the same memory location
-
MSI: 3 states — Modified, Shared, Invalid
-
MESI: 4 states — Modified, Exclusive, Shared, Invalid
-
Snooping: each cache monitors (snoops) the bus for accesses to its cached addresses
-
Write invalidation: on write, invalidate all other copies before modifying

Parallel Performance

Amdahl's Law for P processors: Speedup = 1 / ((1-f) + f/P), where f is the parallelizable fraction. As P→∞, speedup limited by (1-f).

Gustafson's Law: assumes problem size scales with processors. Speedup = P − α(P−1), where α is the serial fraction.

-
Amdahl: sequential fraction limits parallel speedup
-
Gustafson: more processors can handle larger problems
-
Synchronization overhead (locks, barriers) reduces effective parallelism
-
False sharing: different variables in the same cache line cause unnecessary coherence traffic

W11: Storage Systems