Multiprocessor & SIMD
The final week covers parallel architectures: Flynn's taxonomy (SISD, SIMD, MISD, MIMD), SIMD/vector processing, shared-memory multiprocessors, and the cache coherence problem (MSI, MESI protocols). We revisit Amdahl's Law for parallel computing.
Learning Objectives
Key Concepts
Flynn's Taxonomy & SIMD
Flynn's taxonomy classifies architectures by instruction and data streams:
- -SISD: single instruction, single data (traditional uniprocessor)
- -SIMD: single instruction, multiple data (vector processors, GPU cores)
- -MISD: multiple instruction, single data (rare, fault-tolerant systems)
- -MIMD: multiple instruction, multiple data (multiprocessors, multi-core)
SIMD applies the same operation to multiple data elements simultaneously — ideal for array operations, graphics, and scientific computing.
- -
SIMD examples: SSE, AVX (x86), NEON (ARM), GPU shader cores
- -
SIMD works best when the same operation applies to many data elements
- -
Vector length determines how many elements are processed per instruction
- -
Data must be aligned and contiguous for efficient SIMD access
Cache Coherence
In a shared-memory multiprocessor, each core has its own cache. When one core modifies a shared variable, other cores' caches may hold stale copies. Cache coherence protocols ensure all caches see a consistent view of memory.
MSI protocol states: Modified (dirty, exclusive), Shared (clean, may be in other caches), Invalid. MESI protocol adds Exclusive (clean, only copy) to reduce bus traffic.
- -
Coherence problem: multiple caches holding copies of the same memory location
- -
MSI: 3 states — Modified, Shared, Invalid
- -
MESI: 4 states — Modified, Exclusive, Shared, Invalid
- -
Snooping: each cache monitors (snoops) the bus for accesses to its cached addresses
- -
Write invalidation: on write, invalidate all other copies before modifying
Parallel Performance
Amdahl's Law for P processors: Speedup = 1 / ((1-f) + f/P), where f is the parallelizable fraction. As P→∞, speedup limited by (1-f).
Gustafson's Law: assumes problem size scales with processors. Speedup = P − α(P−1), where α is the serial fraction.
- -
Amdahl: sequential fraction limits parallel speedup
- -
Gustafson: more processors can handle larger problems
- -
Synchronization overhead (locks, barriers) reduces effective parallelism
- -
False sharing: different variables in the same cache line cause unnecessary coherence traffic