Computer Organization & Architecture

Unit 8: Introduction to Parallel Processing

From pipelining to multiprocessors โ€” master how modern CPUs execute billions of instructions per second through parallelism, pipelining hazards, and architectural innovations.

โฑ๏ธ 6 hrs theory + 4 hrs lab  |  ๐ŸŽฏ GATE ~2 marks  |  ๐Ÿ–ฅ๏ธ AWS 1 Crore Requests/sec

๐Ÿ’ผ Jobs this unlocks: CPU Design Engineer (โ‚น12โ€“25 LPA)  |  HPC Engineer (โ‚น10โ€“20 LPA)  |  VLSI Engineer (โ‚น8โ€“18 LPA)

Section A

Opening Hook โ€” 1 Crore Requests Per Second

๐Ÿข How Amazon AWS Handles 1 Crore Requests Every Second

When you click "Buy Now" on Amazon during the Great Indian Festival sale, your request is just one of 1 crore (10 million) requests processed every single second across AWS's global infrastructure. Behind this staggering scale lies the same principle you'll learn in this chapter โ€” parallel processing and pipelining.

Every modern CPU in AWS's data centres uses a pipelined architecture โ€” splitting instruction execution into stages so multiple instructions overlap like an assembly line. Their servers use multiprocessor systems with hundreds of cores executing tasks simultaneously. The Intel Xeon and AMD EPYC chips powering AWS use superscalar, out-of-order execution โ€” processing 4โ€“6 instructions per clock cycle.

What if YOU understood how this works? What if you could design pipelined processors, calculate speedups, and understand why your โ‚น50,000 laptop has 8 cores but your program only uses 1? That's exactly what this chapter teaches you.

๐Ÿ‡ฎ๐Ÿ‡ณ AWS India (Hyderabad)๐Ÿ‡ฎ๐Ÿ‡ณ Intel India (Bengaluru)๐Ÿ‡ฎ๐Ÿ‡ณ AMD India (Hyderabad)๐Ÿ‡ฎ๐Ÿ‡ณ Arm India (Bengaluru)๐Ÿ‡ฎ๐Ÿ‡ณ Qualcomm India๐Ÿ‡ฎ๐Ÿ‡ณ ISRO PARAM
India's PARAM Siddhi AI supercomputer at C-DAC Pune ranked 63rd in the world's Top500 supercomputers. It uses 42,000+ GPU cores running in parallel โ€” processing 5.27 petaflops (5.27 ร— 10ยนโต floating-point operations per second). That's the power of parallel processing!
Section B

Learning Outcomes โ€” Bloom's Taxonomy Mapped (12 Outcomes)

Bloom's LevelLearning Outcome
๐Ÿ”ต RememberList the 5 stages of a classic instruction pipeline (IF, ID, EX, MEM, WB) and define each stage's function
๐Ÿ”ต RememberState Flynn's four classifications (SISD, SIMD, MISD, MIMD) with one real-world example for each
๐ŸŸข UnderstandExplain pipeline hazards (structural, data, control) and how forwarding, stalling, and branch prediction resolve them
๐ŸŸข UnderstandDescribe the difference between shared-memory and distributed-memory multiprocessor organisations
๐ŸŸก ApplyCalculate pipeline speedup using S = nk/(k+nโˆ’1) and verify with worked numerical examples
๐ŸŸก ApplyApply Amdahl's Law to compute maximum speedup given fraction of parallelisable code and number of processors
๐ŸŸ  AnalyseDetect RAW, WAR, and WAW data hazards in a given instruction sequence and insert stalls/forwarding paths
๐ŸŸ  AnalyseCompare superscalar vs VLIW architectures on issue width, hardware complexity, and compiler dependency
๐Ÿ”ด EvaluateEvaluate the trade-offs between deeper pipelines (more stages) and increased hazard penalties in real processors
๐Ÿ”ด EvaluateAssess whether adding more processors is cost-effective using Amdahl's Law for a given workload
๐ŸŸฃ CreateDraw a complete space-time diagram for n instructions in a k-stage pipeline with hazard annotations
๐ŸŸฃ CreateDesign a parallel processing solution for a given real-world problem (e.g., image processing, web serving)
Section C

Concept Explanation โ€” Parallel Processing from Scratch

1. Pipelining โ€” The Assembly Line of the CPU

Plain English: Imagine a car factory. If one worker builds an entire car alone โ€” welding, painting, installing engine, fitting seats, quality check โ€” it takes 5 hours per car. But if you set up an assembly line with 5 stations, each doing one task, then after the initial setup, you get one finished car every hour โ€” even though each car still takes 5 hours total. That's pipelining.

Pipeline = Dosa making at Saravana Bhavan. While one dosa is served to the customer (WB), the next one is being cooked on the tawa (EX), the next has batter being spread (ID), and the next order is being taken (IF). Four dosas are in different stages simultaneously โ€” that's pipelining! Without pipelining, you'd wait for each dosa to be completely done before starting the next.

๐Ÿ”ง The 5-Stage Instruction Pipeline

Every instruction in a RISC processor passes through 5 stages:

StageAbbreviationWhat HappensHardware Used
1. Instruction FetchIFRead instruction from memory using PC (Program Counter)Instruction Memory, PC
2. Instruction DecodeIDDecode opcode, read register operands from register fileDecoder, Register File
3. ExecuteEXPerform ALU operation (add, subtract, compare, etc.)ALU
4. Memory AccessMEMRead from / write to data memory (for load/store instructions)Data Memory
5. Write BackWBWrite result back to the destination registerRegister File

Space-Time Diagram โ€” 7 Instructions in a 5-Stage Pipeline

The space-time diagram (also called a pipeline timing diagram) shows which stage each instruction occupies in each clock cycle:

ASCII โ€” Space-Time Diagram
         Clock Cycle โ†’
         CC1   CC2   CC3   CC4   CC5   CC6   CC7   CC8   CC9   CC10  CC11
        โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”
  I1    โ”‚ IF  โ”‚ ID  โ”‚ EX  โ”‚ MEM โ”‚ WB  โ”‚     โ”‚     โ”‚     โ”‚     โ”‚     โ”‚     โ”‚
        โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ค
  I2    โ”‚     โ”‚ IF  โ”‚ ID  โ”‚ EX  โ”‚ MEM โ”‚ WB  โ”‚     โ”‚     โ”‚     โ”‚     โ”‚     โ”‚
        โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ค
  I3    โ”‚     โ”‚     โ”‚ IF  โ”‚ ID  โ”‚ EX  โ”‚ MEM โ”‚ WB  โ”‚     โ”‚     โ”‚     โ”‚     โ”‚
        โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ค
  I4    โ”‚     โ”‚     โ”‚     โ”‚ IF  โ”‚ ID  โ”‚ EX  โ”‚ MEM โ”‚ WB  โ”‚     โ”‚     โ”‚     โ”‚
        โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ค
  I5    โ”‚     โ”‚     โ”‚     โ”‚     โ”‚ IF  โ”‚ ID  โ”‚ EX  โ”‚ MEM โ”‚ WB  โ”‚     โ”‚     โ”‚
        โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ค
  I6    โ”‚     โ”‚     โ”‚     โ”‚     โ”‚     โ”‚ IF  โ”‚ ID  โ”‚ EX  โ”‚ MEM โ”‚ WB  โ”‚     โ”‚
        โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ค
  I7    โ”‚     โ”‚     โ”‚     โ”‚     โ”‚     โ”‚     โ”‚ IF  โ”‚ ID  โ”‚ EX  โ”‚ MEM โ”‚ WB  โ”‚
        โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”˜

  Without pipeline: 7 instructions ร— 5 cycles = 35 cycles
  With pipeline:    k + (nโˆ’1) = 5 + 6 = 11 cycles
  Speedup:          35 / 11 โ‰ˆ 3.18ร—
Key insight from the diagram: In CC5, all 5 pipeline stages are busy โ€” I1 is in WB, I2 in MEM, I3 in EX, I4 in ID, I5 in IF. This is maximum pipeline utilisation. The pipeline is "full." Before CC5, the pipeline is "filling up." This filling phase is why we don't get a perfect 5ร— speedup.

2. Pipeline Speedup โ€” The Math Behind the Magic

๐Ÿ“ Pipeline Speedup Formula

Variables:

โ€ข n = number of instructions

โ€ข k = number of pipeline stages

โ€ข Without pipeline: Total time = n ร— k cycles (each instruction takes k cycles, executed sequentially)

โ€ข With pipeline: Total time = k + (n โˆ’ 1) cycles (first instruction takes k cycles, then one instruction completes every cycle)

Speedup Formula:

Formula
              n ร— k
  Speedup = โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
             k + n โˆ’ 1

  As n โ†’ โˆž,  Speedup โ†’ k  (ideal speedup = number of stages)

  Pipeline Efficiency = Speedup / k = n / (k + n โˆ’ 1)

Worked Example: 5-Stage Pipeline, 100 Instructions

Numerical
  Given: k = 5 stages, n = 100 instructions

  Without pipeline = n ร— k = 100 ร— 5 = 500 clock cycles
  With pipeline     = k + (n โˆ’ 1) = 5 + 99 = 104 clock cycles

  Speedup = 500 / 104 = 4.81ร—

  Efficiency = Speedup / k = 4.81 / 5 = 0.962 = 96.2%

  โœ… With 100 instructions, we achieve 96.2% of ideal speedup!
  โœ… As n โ†’ โˆž, Speedup โ†’ 5 (= k), Efficiency โ†’ 100%
Students often write Speedup = k (number of stages). That's only the ideal (asymptotic) case when n โ†’ โˆž. For GATE numericals, always use the exact formula: Speedup = nk / (k + n โˆ’ 1). The difference matters when n is small!

3. Pipeline Hazards โ€” When the Assembly Line Stalls

Pipelining doesn't always work perfectly. Three types of hazards can break the smooth flow:

3.1 Structural Hazards โ€” Resource Conflicts

What: Two instructions need the same hardware resource at the same time. Example: both IF and MEM stages need to access memory simultaneously, but there's only one memory port.

ASCII โ€” Structural Hazard
         CC1   CC2   CC3   CC4   CC5
        โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”
  I1    โ”‚ IF  โ”‚ ID  โ”‚ EX  โ”‚ MEM โ”‚ WB  โ”‚  โ† I1 reads data memory
        โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ค
  I2    โ”‚     โ”‚ IF  โ”‚ ID  โ”‚ EX  โ”‚ MEM โ”‚
        โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ค
  I3    โ”‚     โ”‚     โ”‚ IF  โ”‚ ID  โ”‚ EX  โ”‚
        โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ค
  I4    โ”‚     โ”‚     โ”‚     โ”‚ โ–ˆโ–ˆ  โ”‚ IF  โ”‚  โ† I4 can't fetch! Memory busy!
        โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”˜
                          โ†‘ CONFLICT: I1 uses MEM, I4 needs IF
                            (both need memory port)

  โ–ˆโ–ˆ = STALL (bubble inserted)

  Solution: Separate instruction memory and data memory
            (Harvard Architecture) โ€” most modern CPUs do this!

3.2 Data Hazards โ€” Register Dependencies

What: An instruction depends on the result of a previous instruction that hasn't finished yet. Three types:

TypeFull NamePatternExampleSeverity
RAWRead After WriteI2 reads a register that I1 writesI1: ADD R1, R2, R3
I2: SUB R4, R1, R5
Most common, most dangerous
WARWrite After ReadI2 writes a register that I1 readsI1: ADD R1, R2, R3
I2: SUB R2, R4, R5
Rare in simple pipelines
WAWWrite After WriteI2 writes same register as I1I1: ADD R1, R2, R3
I2: SUB R1, R4, R5
Only in multi-issue / OoO
RAW Hazard Detailed Example:
Assembly
  I1:  ADD  R1, R2, R3     ; R1 โ† R2 + R3  (R1 written in WB = CC5)
  I2:  SUB  R4, R1, R5     ; R4 โ† R1 โˆ’ R5  (R1 read in ID  = CC3)

  ; Problem: I2 reads R1 in CC3, but I1 writes R1 in CC5!
  ; I2 gets the OLD (stale) value of R1 โ€” WRONG RESULT!

         CC1   CC2   CC3   CC4   CC5
  I1:    IF    ID    EX    MEM   WB  โ†โ”€โ”€ R1 written HERE
  I2:          IF    ID    EX    MEM
                     โ†‘
                R1 read HERE (but R1 not yet written!)
Solutions to Data Hazards:
SolutionHow It WorksPerformance Impact
Stalling (Bubbles)Insert NOP cycles until the result is ready2 stall cycles per RAW hazard
Data ForwardingPass result directly from EX/MEM output to the next instruction's input โ€” bypassing the register file0 or 1 stall cycles
Compiler ReorderingCompiler rearranges instructions to fill delay slots with independent instructions0 stalls (if possible)
ASCII โ€” Forwarding Path
         CC1   CC2   CC3   CC4   CC5   CC6
  I1:    IF    ID    EX    MEM   WB
                     โ”‚
                     โ””โ”€โ”€โ”€โ”€ FORWARD โ”€โ”€โ”€โ”€โ†’ I2 gets R1 here!
  I2:          IF    ID    EX    MEM   WB

  ; With forwarding: R1 result available at end of EX (CC3)
  ; Forwarded directly to I2's EX stage input in CC4
  ; No stall needed! (for ALU โ†’ ALU dependency)

3.3 Control Hazards โ€” Branch Misprediction

What: When a branch instruction (BEQ, BNE, JUMP) changes the program flow, instructions fetched after the branch may be wrong.

ASCII โ€” Control Hazard
         CC1   CC2   CC3   CC4   CC5   CC6   CC7
  BEQ:   IF    ID    EX    MEM   WB
                     โ†‘
              Branch outcome known HERE (CC3)
  I_next:      IF    ID โ†โ”€โ”€ WRONG! Should have fetched
  I_next+1:         IF      from branch target!

  Branch Penalty = 2 cycles (instructions fetched in CC2, CC3 are wrong)
                   These must be FLUSHED (discarded).
Branch Penalty Calculation:
Formula
  Effective CPI = Ideal CPI + Branch frequency ร— Penalty ร— Misprediction rate

  Example:
    Ideal CPI     = 1
    Branch freq.  = 20% (1 in 5 instructions is a branch)
    Penalty       = 2 cycles
    Mispredict    = 30% (branch predictor is 70% accurate)

    Effective CPI = 1 + 0.20 ร— 2 ร— 0.30 = 1 + 0.12 = 1.12

    Performance loss = (1.12 โˆ’ 1) / 1 = 12% slower than ideal
Solutions to Control Hazards:
TechniqueHowAccuracy
Predict Not TakenAlways assume branch is NOT taken; flush if wrong~50โ€“60%
Predict TakenAlways assume branch IS taken~60โ€“70%
1-bit PredictorRemember last outcome (taken/not-taken)~80โ€“85%
2-bit PredictorUse a saturating counter (4 states); must mispredict twice to switch~90โ€“93%
Branch Target BufferCache of recent branch targets for instant redirection~95%+
Modern Intel Core i9 processors use a combination of techniques โ€” a multi-level branch predictor called TAGE (TAgged GEometric predictor) that achieves >96% accuracy. A single misprediction on a 20-stage pipeline wastes ~20 cycles โ€” that's why prediction accuracy is critical!

4. Flynn's Classification โ€” Categorising Parallel Architectures

In 1966, Michael Flynn proposed a classification of computer architectures based on the number of instruction streams and data streams processed simultaneously.

ASCII โ€” Flynn's Classification Grid
                         Data Stream
                    Single (SD)     Multiple (MD)
                 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  Instruction    โ”‚              โ”‚                  โ”‚
  Stream    SI   โ”‚    SISD      โ”‚      SIMD        โ”‚
  Single         โ”‚  (Classic    โ”‚  (Vector/GPU     โ”‚
                 โ”‚   Von Neumann)โ”‚  Array Proc.)   โ”‚
                 โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
  Instruction    โ”‚              โ”‚                  โ”‚
  Stream    MI   โ”‚    MISD      โ”‚      MIMD        โ”‚
  Multiple       โ”‚  (Rareโ€”      โ”‚  (Multi-core     โ”‚
                 โ”‚   Systolic)  โ”‚   Servers)       โ”‚
                 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
TypeInstruction StreamsData StreamsDescriptionExamples
SISD11Traditional single-core CPU. One instruction at a time on one data item.Old Intel 8086, classic Von Neumann machines
SIMD1MultipleOne instruction operates on many data items simultaneously. Perfect for data-parallel tasks.GPU (NVIDIA CUDA), Intel SSE/AVX, Array processors
MISDMultiple1Multiple instructions operate on the same data. Very rare in practice.Systolic arrays, fault-tolerant systems (Space Shuttle)
MIMDMultipleMultipleMultiple processors execute different instructions on different data. Most common parallel architecture today.Multi-core CPUs (i5/i7), server clusters, AWS EC2
ASCII โ€” SIMD Operation Example
  SIMD: Add two arrays A[] + B[] โ†’ C[]

  Traditional (SISD):              SIMD (one instruction, 4 data):
  โ”Œโ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”          โ”Œโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”
  โ”‚A[0]โ”‚+ โ”‚B[0]โ”‚= โ”‚C[0]โ”‚  Cycle1  โ”‚A[0]โ”‚A[1]โ”‚A[2]โ”‚A[3]โ”‚ + โ”‚B[0]โ”‚B[1]โ”‚B[2]โ”‚B[3]โ”‚
  โ””โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”˜          โ””โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”˜
  โ”Œโ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”                     โ”‚
  โ”‚A[1]โ”‚+ โ”‚B[1]โ”‚= โ”‚C[1]โ”‚  Cycle2             โ–ผ  (ALL done in 1 cycle!)
  โ””โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”˜          โ”Œโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”
  โ”Œโ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”          โ”‚C[0]โ”‚C[1]โ”‚C[2]โ”‚C[3]โ”‚
  โ”‚A[2]โ”‚+ โ”‚B[2]โ”‚= โ”‚C[2]โ”‚  Cycle3  โ””โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”˜
  โ””โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”˜
  โ”Œโ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”          4ร— speedup for array operations!
  โ”‚A[3]โ”‚+ โ”‚B[3]โ”‚= โ”‚C[3]โ”‚  Cycle4
  โ””โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”˜
ISRO's PARAM supercomputers use MIMD architecture โ€” thousands of processors each running different tasks on different data sets for weather prediction, satellite image processing, and molecular simulations. India's National Supercomputing Mission aims to deploy 73 supercomputers across IITs and IISERs โ€” all MIMD systems.

5. Vector/Array Processors โ€” SIMD at Scale

Vector Processor: A CPU designed to operate on entire arrays (vectors) in a single instruction. Instead of a loop processing one element at a time, a vector instruction processes all elements simultaneously.

FeatureScalar ProcessorVector Processor
OperationOne element at a timeEntire vector (e.g., 64 elements) at once
Examplefor(i=0;i<64;i++) C[i]=A[i]+B[i];VADD V3, V1, V2 (one instruction!)
Loop overhead64 loop iterations, branches, counter updatesNo loop needed
Best forGeneral-purpose, irregular codeScientific computing, multimedia, AI
Historical exampleIntel 8086Cray-1 (1976), NEC SX-Aurora

GPU Parallel Model โ€” The Modern Vector Processor

Modern GPUs (NVIDIA, AMD) are essentially massively parallel SIMD processors. An NVIDIA A100 GPU has 6,912 CUDA cores that execute the same instruction on thousands of data elements simultaneously.

ASCII โ€” GPU Parallel Model
  CPU (8 cores)                    GPU (6912 cores)
  โ”Œโ”€โ”€โ”โ”Œโ”€โ”€โ”โ”Œโ”€โ”€โ”โ”Œโ”€โ”€โ”               โ”Œโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”
  โ”‚C1โ”‚โ”‚C2โ”‚โ”‚C3โ”‚โ”‚C4โ”‚               โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ SM 1
  โ””โ”€โ”€โ”˜โ””โ”€โ”€โ”˜โ””โ”€โ”€โ”˜โ””โ”€โ”€โ”˜               โ”œโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ค
  โ”Œโ”€โ”€โ”โ”Œโ”€โ”€โ”โ”Œโ”€โ”€โ”โ”Œโ”€โ”€โ”               โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ SM 2
  โ”‚C5โ”‚โ”‚C6โ”‚โ”‚C7โ”‚โ”‚C8โ”‚               โ”œโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ค
  โ””โ”€โ”€โ”˜โ””โ”€โ”€โ”˜โ””โ”€โ”€โ”˜โ””โ”€โ”€โ”˜               โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ SM 3
                                  โ”œโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ค
  Few cores,                      โ”‚...   108 Streaming      โ”‚
  complex each                    โ”‚     Multiprocessors     โ”‚
                                  โ””โ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”˜
                                  Thousands of simple cores
For GATE: Remember that GPUs follow the SIMD/SIMT model (Single Instruction, Multiple Threads). The CPU is MIMD. Understanding this distinction is crucial for questions on Flynn's classification.

6. Multiprocessor Organization

When you have multiple processors working together, the key question is: How do they share data?

FeatureShared Memory (Tightly Coupled)Distributed Memory (Loosely Coupled)
MemoryAll processors access a common memoryEach processor has its own local memory
CommunicationThrough shared variables in common memoryThrough message passing (network)
ProgrammingEasier โ€” just read/write shared variablesHarder โ€” explicit send/receive messages
ScalabilityLimited (memory bus becomes bottleneck)Highly scalable (1000s of nodes)
ExampleMulti-core laptop (i5, Ryzen 5)AWS server cluster, Google's data centres
Programming modelOpenMP, pthreadsMPI (Message Passing Interface)
ASCII โ€” Shared vs Distributed Memory
  SHARED MEMORY:                    DISTRIBUTED MEMORY:
  โ”Œโ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”
  โ”‚ P1 โ”‚ โ”‚ P2 โ”‚ โ”‚ P3 โ”‚ โ”‚ P4 โ”‚      โ”‚ P1 โ”‚M1 โ”‚ โ”‚ P2 โ”‚M2 โ”‚ โ”‚ P3 โ”‚M3 โ”‚
  โ””โ”€โ”€โ”ฌโ”€โ”˜ โ””โ”€โ”€โ”ฌโ”€โ”˜ โ””โ”€โ”€โ”ฌโ”€โ”˜ โ””โ”€โ”€โ”ฌโ”€โ”˜      โ””โ”€โ”€โ”ฌโ”€โ”ดโ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”ฌโ”€โ”ดโ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”ฌโ”€โ”ดโ”€โ”€โ”€โ”˜
     โ”‚      โ”‚      โ”‚      โ”‚            โ”‚          โ”‚          โ”‚
  โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•      โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•
     โ”‚     SHARED BUS     โ”‚             โ”‚  NETWORK (LAN/InfiniBand)
  โ”Œโ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”
  โ”‚    SHARED MEMORY (RAM)   โ”‚       Each processor has its own
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜       private memory. Data shared
  All processors see the same        via explicit messages.
  memory address space.

Interconnection Networks

Network TypeTopologyCostUsed In
BusSingle shared wireCheapestSmall multi-core (2โ€“8 cores)
CrossbarFull connectivity matrixExpensive (Nยฒ switches)High-performance servers
Mesh2D grid of nodesModerateGPU internal, Network-on-Chip
HypercubeEach node connected to logโ‚‚N othersModerateOlder supercomputers
Omega/ButterflyMulti-stage switching networkModerateTheoretical/exam questions

7. Superscalar & VLIW โ€” Multiple Issue Architectures

Pipelining gives us one instruction per cycle (ideally). But what if we want multiple instructions per cycle? Two approaches:

๐Ÿš€ Superscalar Architecture

What: Hardware dynamically detects independent instructions and issues 2โ€“6 of them per clock cycle to multiple execution units.

Analogy: A superscalar CPU is like a restaurant with multiple chefs. The manager (hardware scheduler) looks at the pending orders and assigns independent orders to different chefs simultaneously.

Key features:

โ€ข Multiple execution units (2+ ALUs, 1+ FPU, 1+ Load/Store unit)

โ€ข Hardware dependency checking and scheduling

โ€ข Out-of-order execution (execute instructions in any order, as long as dependencies are respected)

โ€ข Speculative execution (predict branch outcomes and execute speculatively)

Examples: Intel Core i7 (4-way superscalar), AMD Ryzen (6-way), Apple M2 (8-way)

๐Ÿ“ฆ VLIW โ€” Very Long Instruction Word

What: The compiler bundles multiple independent operations into one very long instruction word. Hardware simply executes them โ€” no dynamic scheduling needed.

Analogy: A VLIW CPU is like a pre-planned meal kit (e.g., from Licious or FreshMenu). The chef (compiler) has already decided which ingredients go together โ€” the kitchen just follows the instructions. No manager needed.

Key features:

โ€ข Compiler does all the hard work of finding parallelism

โ€ข Simpler hardware (no dependency checking logic)

โ€ข Instructions are 128โ€“1024 bits wide (containing 3โ€“8 operations)

โ€ข If compiler can't find enough parallel ops, NOPs are inserted (wasted slots)

Examples: Intel Itanium (IA-64), TI C6000 DSP processors, some embedded processors

FeatureSuperscalarVLIW
Scheduling done byHardware (at runtime)Compiler (at compile time)
Hardware complexityVery complex (dependency logic, reorder buffer)Simpler (no dynamic scheduling)
Compiler complexityStandard compilers workNeeds very sophisticated compiler
Binary compatibilityOld programs run on new hardwareRecompilation needed for new hardware
Power consumptionHigher (complex hardware)Lower (simpler hardware)
Best forGeneral-purpose (laptops, servers)Embedded, DSP, specific workloads
Market statusDominant (Intel, AMD, Arm)Niche (DSPs, some embedded)
Students confuse superscalar with pipelining. Pipelining = one instruction per stage, overlapping stages (like one assembly line). Superscalar = multiple pipelines, issuing multiple instructions per cycle (like multiple assembly lines). A superscalar CPU is always pipelined, but a pipelined CPU is not necessarily superscalar.

8. Amdahl's Law โ€” The Limit of Parallelism

The fundamental question: If I add more processors, how much faster will my program run? Amdahl's Law gives the sobering answer.

๐Ÿ“ Amdahl's Law Formula

Variables:

โ€ข f = fraction of the program that can be parallelised (0 โ‰ค f โ‰ค 1)

โ€ข p = number of processors

โ€ข (1 โˆ’ f) = fraction that must remain serial (cannot be parallelised)

Formula
                    1
  Speedup = โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
             (1 โˆ’ f) + (f / p)

  As p โ†’ โˆž:
                   1
  Max Speedup = โ”€โ”€โ”€โ”€โ”€โ”€โ”€
                 1 โˆ’ f

  ; Even with infinite processors, the serial portion limits speedup!

Worked Example: Amdahl's Law

Numerical
  Given: 75% of a program can be parallelised (f = 0.75)
         Number of processors p = 4

  Speedup = 1 / ((1 โˆ’ 0.75) + (0.75 / 4))
          = 1 / (0.25 + 0.1875)
          = 1 / 0.4375
          = 2.286ร—

  With p = 16:
  Speedup = 1 / (0.25 + 0.75/16)
          = 1 / (0.25 + 0.047)
          = 1 / 0.297
          = 3.37ร—

  With p = โˆž:
  Max Speedup = 1 / (1 โˆ’ 0.75) = 1 / 0.25 = 4ร—

  โœ… Even with INFINITE processors, max speedup is only 4ร—!
  โœ… The 25% serial portion is the bottleneck.

  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚  Processors โ”‚  Speedup  โ”‚  Efficiency (Speedup/p)          โ”‚
  โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
  โ”‚      1      โ”‚   1.00ร—   โ”‚   100%                           โ”‚
  โ”‚      2      โ”‚   1.60ร—   โ”‚    80%                           โ”‚
  โ”‚      4      โ”‚   2.29ร—   โ”‚    57%                           โ”‚
  โ”‚      8      โ”‚   2.91ร—   โ”‚    36%                           โ”‚
  โ”‚     16      โ”‚   3.37ร—   โ”‚    21%                           โ”‚
  โ”‚     64      โ”‚   3.76ร—   โ”‚     6%                           โ”‚
  โ”‚      โˆž      โ”‚   4.00ร—   โ”‚     0% โ†’ diminishing returns!    โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
This is why your 8-core Ryzen laptop doesn't feel 8ร— faster than a single-core machine. Most everyday applications (browsing, Office) have significant serial portions. However, tasks like video rendering (90%+ parallel) or machine learning training (95%+ parallel) see dramatic speedups with more cores โ€” which is why content creators and ML engineers buy 16โ€“64 core machines.
GATE Tip: Amdahl's Law is a favourite 2-mark GATE question. The trick: always identify f (parallel fraction) and (1โˆ’f) (serial fraction) first. Common trap: students forget that the serial portion cannot be sped up by adding processors.
Section D

Learn by Doing โ€” 3-Tier Lab Structure

๐ŸŸข Tier 1 โ€” GUIDED: Draw a Pipeline Space-Time Chart

โฑ๏ธ 30โ€“45 minutesBeginnerPen-and-paper exercise

Task: Draw the space-time diagram for 6 instructions in a 5-stage pipeline

Step 1: Draw a grid. Rows = Instructions (I1 to I6). Columns = Clock Cycles (CC1 to CC10).

Step 2: Fill in I1: IF in CC1, ID in CC2, EX in CC3, MEM in CC4, WB in CC5.

Step 3: I2 starts in CC2: IF in CC2, ID in CC3, EX in CC4, MEM in CC5, WB in CC6.

Step 4: Continue for I3โ€“I6, each starting one cycle after the previous.

Step 5: Calculate: Total cycles = k + (n โˆ’ 1) = 5 + 5 = 10 cycles.

Step 6: Calculate Speedup = (6 ร— 5) / 10 = 3.0ร—

Step 7: Now introduce a data hazard between I2 and I3 (I3 depends on I2's result). Insert 2 stall bubbles after I2. Redraw the diagram and recalculate total cycles.

Expected Output
  Without hazard: 10 cycles, Speedup = 3.0ร—
  With 2-cycle stall: 12 cycles, Speedup = 30/12 = 2.5ร—
  Performance loss due to hazard: (3.0 โˆ’ 2.5) / 3.0 = 16.7%
Stretch: Add a control hazard at I4 (branch instruction with 2-cycle penalty). How many total cycles now?

๐ŸŸก Tier 2 โ€” SEMI-GUIDED: Pipeline Speedup Calculation Sheet

โฑ๏ธ 45โ€“60 minutesIntermediateCalculator + pen-and-paper

Your Mission:

Solve the following pipeline speedup problems. Show all working.

Problem 1: A 7-stage pipeline executes 200 instructions. Calculate speedup and efficiency.

Problem 2: If 15% of instructions cause a 3-cycle data hazard stall, what is the effective CPI?

Problem 3: Using Amdahl's Law, if 80% of a program is parallelisable and you have 8 processors, calculate speedup. What if you upgrade to 64 processors?

Problem 4: A superscalar processor issues 3 instructions per cycle. If 20% of cycles have only 1 useful instruction (due to dependencies), what is the average IPC (Instructions Per Cycle)?

Stretch: Compare your Amdahl's Law results with Gustafson's Law for the same parameters. Which gives a more optimistic result and why?

๐Ÿ”ด Tier 3 โ€” OPEN CHALLENGE: Python Pipeline Speedup Calculator

โฑ๏ธ 60โ€“90 minutesAdvancedPython coding

Build a Python program that calculates and visualises pipeline performance:

Python
def pipeline_speedup(n, k):
    """Calculate pipeline speedup and efficiency."""
    sequential_time = n * k
    pipeline_time = k + (n - 1)
    speedup = sequential_time / pipeline_time
    efficiency = speedup / k
    return {
        'sequential_cycles': sequential_time,
        'pipeline_cycles': pipeline_time,
        'speedup': round(speedup, 3),
        'efficiency': round(efficiency * 100, 2),
    }

def amdahl_speedup(f, p):
    """Calculate Amdahl's Law speedup."""
    speedup = 1 / ((1 - f) + (f / p))
    max_speedup = 1 / (1 - f) if f < 1 else float('inf')
    return {
        'speedup': round(speedup, 3),
        'max_speedup': round(max_speedup, 3),
        'efficiency': round((speedup / p) * 100, 2)
    }

# โ”€โ”€โ”€ Test Pipeline Speedup โ”€โ”€โ”€
print("=== Pipeline Speedup ===")
for n in [10, 50, 100, 1000]:
    result = pipeline_speedup(n, 5)
    print(f"n={n:5} | Speedup={result['speedup']:6} | Efficiency={result['efficiency']}%")

# โ”€โ”€โ”€ Test Amdahl's Law โ”€โ”€โ”€
print("\n=== Amdahl's Law (f=0.90) ===")
for p in [1, 2, 4, 8, 16, 64, 256]:
    result = amdahl_speedup(0.90, p)
    print(f"p={p:4} | Speedup={result['speedup']:7} | Max={result['max_speedup']} | Eff={result['efficiency']}%")
=== Pipeline Speedup === n= 10 | Speedup= 3.571 | Efficiency=71.43% n= 50 | Speedup= 4.63 | Efficiency=92.59% n= 100 | Speedup= 4.808 | Efficiency=96.15% n= 1000 | Speedup= 4.98 | Efficiency=99.6% === Amdahl's Law (f=0.90) === p= 1 | Speedup= 1.0 | Max=10.0 | Eff=100.0% p= 2 | Speedup= 1.818 | Max=10.0 | Eff=90.91% p= 4 | Speedup= 3.077 | Max=10.0 | Eff=76.92% p= 8 | Speedup= 4.706 | Max=10.0 | Eff=58.82% p= 16 | Speedup= 6.4 | Max=10.0 | Eff=40.0% p= 64 | Speedup= 8.767 | Max=10.0 | Eff=13.7% p= 256 | Speedup= 9.671 | Max=10.0 | Eff=3.78%
Add this to your GitHub portfolio! A well-documented Python pipeline calculator with clear comments shows employers you understand both the theory and can implement it. Add a matplotlib graph showing speedup vs. processors for different f values โ€” instant portfolio piece.
Section E

Problem Bank โ€” Diagrams, Numericals, GATE & Industry

Diagram-Based Questions (3)

D1

Draw the complete space-time diagram for 8 instructions in a 6-stage pipeline. Mark the pipeline filling phase, steady state, and draining phase. Calculate speedup.

CreateDiagram
Total cycles = 6 + 7 = 13. Filling = CC1โ€“CC6 (6 cycles). Steady state = CC7โ€“CC8 (2 cycles, all stages busy). Draining = CC9โ€“CC13 (5 cycles). Speedup = (8ร—6)/13 = 3.69ร—. Efficiency = 3.69/6 = 61.5%.
D2

Draw Flynn's classification as a 2ร—2 grid with instruction streams on one axis and data streams on the other. For each quadrant, draw a block diagram showing the processor-memory organisation.

CreateDiagram
SISD: 1 CU โ†’ 1 PU โ†’ 1 MU. SIMD: 1 CU โ†’ N PUs โ†’ N MUs. MISD: N CUs โ†’ N PUs โ†’ 1 MU. MIMD: N CUs โ†’ N PUs โ†’ N MUs (shared or distributed). Label each with real-world examples.
D3

Draw the forwarding (bypass) path in a 5-stage pipeline for the following instructions: I1: ADD R1, R2, R3 and I2: SUB R4, R1, R5. Show exactly where the forwarding wire connects.

ApplyDiagram
Forwarding path goes from I1's EX/MEM pipeline register output to I2's ALU input (EX stage input mux). The EX/MEM latch stores R1's computed value at end of CC3. This value is forwarded to I2's EX stage in CC4 through a bypass multiplexer, avoiding 2 stall cycles.

Numerical Problems (6)

N1 โ€” Pipeline Speedup

A 4-stage pipeline processes 500 instructions. Each stage takes 2 ns. (a) Calculate total execution time with and without pipelining. (b) Calculate speedup. (c) What is the throughput (instructions/ns)?

ApplyNumerical
Without pipeline: 500 ร— 4 ร— 2 = 4000 ns. With pipeline: (4 + 499) ร— 2 = 503 ร— 2 = 1006 ns. Speedup = 4000/1006 = 3.976ร—. Throughput = 500/1006 = 0.497 instructions/ns โ‰ˆ 497 MIPS (if 1 ns = 1 cycle).
N2 โ€” Pipeline with Stalls

A 5-stage pipeline has 200 instructions. 30% of instructions cause a 1-cycle data hazard stall. Calculate the actual speedup and effective CPI.

ApplyNumerical
Without pipeline: 200 ร— 5 = 1000 cycles. Stall cycles = 200 ร— 0.30 ร— 1 = 60. Pipeline cycles = (5 + 199) + 60 = 264. Speedup = 1000/264 = 3.79ร—. Effective CPI = 264/200 = 1.32.
N3 โ€” Amdahl's Law

A program takes 100 seconds. 60% of the execution can be parallelised. (a) Find speedup with 4 processors. (b) How many processors for a 2ร— speedup? (c) What is the maximum possible speedup?

ApplyNumerical
(a) S = 1/(0.4 + 0.6/4) = 1/(0.4 + 0.15) = 1/0.55 = 1.818ร—. Time = 100/1.818 = 55 sec. (b) 2 = 1/(0.4 + 0.6/p) โ†’ 0.4 + 0.6/p = 0.5 โ†’ 0.6/p = 0.1 โ†’ p = 6 processors. (c) Max S = 1/0.4 = 2.5ร—.
N4 โ€” Branch Penalty

A 5-stage pipeline processes 1000 instructions. 20% are branches. Branch penalty is 2 cycles. Branch predictor accuracy is 85%. Calculate effective CPI and total execution time (clock period = 1 ns).

AnalyseNumerical
Misprediction rate = 15%. Effective CPI = 1 + 0.20 ร— 2 ร— 0.15 = 1 + 0.06 = 1.06. Total cycles = 1000 ร— 1.06 + 4 (pipeline fill) = 1064. Time = 1064 ร— 1 ns = 1064 ns. Without branches: 1004 ns. Slowdown = 6%.
N5 โ€” Hazard Detection

Identify ALL data hazards (RAW, WAR, WAW) in this instruction sequence:

I1: ADD  R1, R2, R3
I2: SUB  R4, R1, R5
I3: AND  R6, R4, R1
I4: OR   R1, R7, R8
I5: ADD  R4, R1, R6
AnalyseNumerical
RAW hazards: I1โ†’I2 (R1), I1โ†’I3 (R1), I2โ†’I3 (R4), I4โ†’I5 (R1), I3โ†’I5 (R6). WAR: I3โ†’I4 (R1 โ€” I3 reads R1, I4 writes R1). WAW: I1โ†’I4 (R1 โ€” both write R1), I2โ†’I5 (R4 โ€” both write R4). Total: 5 RAW, 1 WAR, 2 WAW = 8 hazards.
N6 โ€” Superscalar IPC

A 4-issue superscalar processor executes 10,000 instructions in 4,000 clock cycles. (a) What is the average IPC? (b) What is the average issue rate utilisation? (c) If the clock frequency is 3 GHz, what is the MIPS rating?

ApplyNumerical
(a) IPC = 10000/4000 = 2.5 instructions/cycle. (b) Utilisation = 2.5/4 = 62.5% of issue slots used. (c) MIPS = IPC ร— frequency = 2.5 ร— 3000 MHz = 7500 MIPS.

Industry Application Questions (3)

๐Ÿญ Industry Q1: AWS Auto Scaling and Amdahl's Law

AWS auto-scaling adds more EC2 instances during peak traffic. An e-commerce application has 70% parallelisable workload. Currently running on 4 instances. Management wants to know: will doubling to 8 instances give 2ร— performance? Use Amdahl's Law to justify your answer.

With 4 instances: S = 1/(0.3 + 0.7/4) = 1/0.475 = 2.105ร—. With 8: S = 1/(0.3 + 0.7/8) = 1/0.3875 = 2.581ร—. Improvement from 4โ†’8: only 2.581/2.105 = 1.226ร— (22.6% improvement, NOT 2ร—). Doubling servers only gives ~23% more performance. Recommend: optimize the 30% serial portion instead.

๐Ÿญ Industry Q2: Intel vs AMD Pipeline Depth

Intel's Pentium 4 (NetBurst) had a 31-stage pipeline, while AMD's Athlon 64 had only a 12-stage pipeline. The Pentium 4 ran at higher clock speed (3.8 GHz vs 2.4 GHz) but often performed worse in real benchmarks. Explain why using pipeline hazard concepts.

Deeper pipeline โ†’ higher clock speed (each stage does less work โ†’ shorter cycle). But deeper pipeline โ†’ higher branch misprediction penalty (31 wasted cycles vs 12). With ~15% misprediction rate, P4 wastes 31ร—0.15 = 4.65 cycles per branch on average vs AMD's 12ร—0.15 = 1.8 cycles. This large penalty, combined with other hazards, offset the clock speed advantage. This is why Intel abandoned deep pipelines and returned to shorter pipelines with Core architecture.

๐Ÿญ Industry Q3: GPU vs CPU for Deep Learning at IIT Madras

IIT Madras's AI lab trains a neural network that involves multiplying 1000ร—1000 matrices. They have a choice: 16-core Xeon CPU (MIMD) or NVIDIA A100 GPU (6912 CUDA cores, SIMD). Matrix multiplication is 99% parallelisable. Which should they choose and why? Use Amdahl's Law and Flynn's classification to justify.

Matrix multiplication is data-parallel โ†’ SIMD (GPU) is ideal. Amdahl's with CPU: S = 1/(0.01 + 0.99/16) = 1/0.0719 = 13.9ร—. With GPU (treating 6912 cores): S = 1/(0.01 + 0.99/6912) = 1/0.01014 = 98.6ร—. GPU gives ~7ร— better speedup than CPU for this workload. Choose GPU. Flynn's: Matrix ops are same operation on different data โ†’ SIMD. GPUs are designed for SIMD. CPUs are MIMD โ€” each core can do different work, but that flexibility is wasted on uniform matrix ops.

GATE-Style Questions (5)

GATE-1 (2 marks)

A 5-stage pipeline has a clock cycle time of 10 ns. A non-pipelined version takes 40 ns per instruction. For 1000 instructions, the speedup achieved by pipelining is approximately:

  1. 3.98
  2. 4.00
  3. 5.00
  4. 3.50
ApplyGATE
โœ… (A). Non-pipelined: 1000 ร— 40 = 40,000 ns. Pipelined: (5 + 999) ร— 10 = 1004 ร— 10 = 10,040 ns. Speedup = 40,000/10,040 = 3.984 โ‰ˆ 3.98. Note: non-pipelined time โ‰  k ร— pipeline cycle if stages are unbalanced (40 ns โ‰  5 ร— 10 ns = 50 ns here).
GATE-2 (2 marks)

Consider a program with 80% parallelisable code. Using Amdahl's law, the speedup with 4 processors is ___. (fill in the blank, rounded to 2 decimal places)

ApplyGATE
โœ… 2.50. S = 1/(0.20 + 0.80/4) = 1/(0.20 + 0.20) = 1/0.40 = 2.50.
GATE-3 (1 mark)

Which of the following is an example of MIMD architecture?

  1. GPU executing shader programs
  2. Systolic array
  3. Multi-core processor running different threads
  4. Cray-1 vector processor
RememberGATE
โœ… (C). Multi-core processor running different threads = Multiple Instruction streams, Multiple Data streams = MIMD. GPU = SIMD (same instruction on different data). Systolic array = MISD (debatable). Cray-1 = SIMD vector.
GATE-4 (2 marks)

In a pipelined processor, the following instructions are executed:

I1: LOAD  R1, 0(R2)
I2: ADD   R3, R1, R4
I3: SUB   R5, R3, R6

How many stall cycles are needed with forwarding? (Assume load-use hazard requires 1 stall even with forwarding.)

  1. 0
  2. 1
  3. 2
  4. 3
AnalyseGATE
โœ… (B) 1. I1โ†’I2: LOAD-use hazard (R1 available after MEM, but I2 needs it in EX) = 1 stall even with forwarding. I2โ†’I3: ALU-ALU dependency (R3). With forwarding from I2's EX output to I3's EX input = 0 stalls. Total = 1 stall.
GATE-5 (2 marks)

A superscalar processor can issue 2 instructions per cycle. If 25% of instruction pairs have dependencies that prevent dual issue, what is the effective IPC?

  1. 1.25
  2. 1.50
  3. 1.75
  4. 2.00
ApplyGATE
โœ… (C). 75% of cycles issue 2 instructions, 25% issue only 1. Average IPC = 0.75 ร— 2 + 0.25 ร— 1 = 1.50 + 0.25 = 1.75.
Section F

MCQ Assessment Bank โ€” 30 Questions (Bloom's Mapped)

Remember / Identify (Q1โ€“Q5)

Q1

The 5 stages of a classic RISC pipeline are:

  1. IF, ID, EX, MEM, WB
  2. IF, DE, EX, ST, WB
  3. FE, DC, EX, MA, WR
  4. IF, ID, ALU, DM, RF
Remember
โœ… (A) IF (Instruction Fetch), ID (Instruction Decode), EX (Execute), MEM (Memory Access), WB (Write Back).
Q2

Flynn's classification that represents a multi-core processor running different programs is:

  1. SISD
  2. SIMD
  3. MISD
  4. MIMD
Remember
โœ… (D) MIMD โ€” Multiple Instruction streams, Multiple Data streams. Each core runs its own program on its own data.
Q3

In Amdahl's Law, what does f represent?

  1. Frequency of the processor
  2. Fraction of the program that can be parallelised
  3. Number of floating-point operations
  4. Fan-out of the logic gates
Remember
โœ… (B) f = fraction of the program that can be parallelised (0 โ‰ค f โ‰ค 1).
Q4

RAW stands for:

  1. Read And Write
  2. Read After Write
  3. Register Allocation Width
  4. Random Access Window
Remember
โœ… (B) RAW = Read After Write โ€” the most common data hazard where an instruction reads a register before a previous instruction writes to it.
Q5

VLIW stands for:

  1. Variable Length Instruction Width
  2. Very Long Instruction Word
  3. Virtual Logic Instruction Wire
  4. Vector Linear Instruction Wrapper
Remember
โœ… (B) VLIW = Very Long Instruction Word โ€” compiler packs multiple independent operations into one wide instruction.

Understand / Explain (Q6โ€“Q10)

Q6

Why does a deeper pipeline NOT always result in higher performance?

  1. Deeper pipelines require more transistors
  2. Pipeline registers add latency overhead, and branch/data hazard penalties increase proportionally
  3. Deeper pipelines cannot be clocked at high frequencies
  4. Only RISC processors can have deep pipelines
Understand
โœ… (B) Each pipeline stage needs a register (latch) which adds overhead. More critically, a branch misprediction on a 20-stage pipeline wastes 20 cycles vs only 5 on a 5-stage pipeline. The penalty grows with depth.
Q7

Data forwarding eliminates stalls by:

  1. Predicting the result before execution
  2. Passing the computed result directly from one pipeline stage's output to the next instruction's input, bypassing the register file
  3. Executing both instructions in the same cycle
  4. Storing results in cache instead of registers
Understand
โœ… (B) Forwarding (bypassing) routes the result from the EX/MEM stage output directly to the next instruction's ALU input through a multiplexer, without waiting for the WB stage to write it to the register file.
Q8

Why is SIMD architecture ideal for image processing?

  1. Images require complex branching logic
  2. Each pixel needs a different operation
  3. The same operation (e.g., brightness adjustment) is applied to millions of pixels simultaneously โ€” perfect data parallelism
  4. Images are always stored sequentially in memory
Understand
โœ… (C) Image processing applies the same mathematical operation to every pixel. SIMD executes one instruction on multiple data elements simultaneously โ€” exactly matching pixel-level parallelism.
Q9

In shared-memory multiprocessors, the primary scalability bottleneck is:

  1. Disk I/O speed
  2. Memory bus bandwidth โ€” all processors compete for the same bus
  3. Number of registers per processor
  4. Operating system kernel size
Understand
โœ… (B) With shared memory, all processors access memory through a common bus. As the number of processors increases, they contend for bus bandwidth, creating a bottleneck. This limits shared-memory systems to ~16โ€“64 processors typically.
Q10

Amdahl's Law shows that adding more processors yields diminishing returns because:

  1. Processors interfere with each other
  2. The serial (non-parallelisable) portion remains constant regardless of the number of processors
  3. Memory speed doesn't scale with processors
  4. Operating system overhead increases linearly
Understand
โœ… (B) Even with infinite processors, the serial fraction (1โˆ’f) cannot be parallelised. This serial portion becomes the dominant time component as p increases, limiting maximum speedup to 1/(1โˆ’f).

Apply / Calculate (Q11โ€“Q15)

Q11

A 4-stage pipeline executes 100 instructions. The speedup is:

  1. 3.88
  2. 4.00
  3. 3.50
  4. 3.00
Apply
โœ… (A) S = nk/(k+nโˆ’1) = 100ร—4/(4+99) = 400/103 = 3.883 โ‰ˆ 3.88.
Q12

Using Amdahl's Law, if 90% of a program is parallelisable and we use 10 processors, the speedup is:

  1. 5.26
  2. 9.00
  3. 10.00
  4. 4.50
Apply
โœ… (A) S = 1/(0.10 + 0.90/10) = 1/(0.10 + 0.09) = 1/0.19 = 5.263 โ‰ˆ 5.26.
Q13

A pipeline has 6 stages. The maximum asymptotic speedup (n โ†’ โˆž) is:

  1. 5
  2. 6
  3. 7
  4. โˆž
Apply
โœ… (B) As n โ†’ โˆž, S = nk/(k+nโˆ’1) โ†’ k = 6. The ideal speedup equals the number of pipeline stages.
Q14

Branch frequency = 25%, penalty = 3 cycles, predictor accuracy = 80%. The effective CPI (ideal CPI = 1) is:

  1. 1.15
  2. 1.25
  3. 1.75
  4. 1.05
Apply
โœ… (A) CPI = 1 + 0.25 ร— 3 ร— 0.20 = 1 + 0.15 = 1.15. The misprediction rate = 1 โˆ’ 0.80 = 0.20.
Q15

Pipeline efficiency for 50 instructions in a 10-stage pipeline is:

  1. 84.7%
  2. 50.0%
  3. 90.0%
  4. 66.7%
Apply
โœ… (A) Efficiency = n/(k+nโˆ’1) = 50/(10+49) = 50/59 = 0.8475 = 84.7%.

Analyse / Compare (Q16โ€“Q20)

Q16

Consider: I1: ADD R1,R2,R3 and I2: ADD R2,R1,R4. The hazard between I1 and I2 includes:

  1. RAW only
  2. WAR only
  3. Both RAW and WAR
  4. No hazard
Analyse
โœ… (C) RAW: I2 reads R1, which I1 writes. WAR: I2 writes R2, which I1 reads. Both hazards exist simultaneously.
Q17

Which pipeline hazard is completely eliminated by the Harvard architecture (separate instruction and data memories)?

  1. Data hazard (RAW)
  2. Control hazard
  3. Structural hazard (memory access conflict)
  4. WAW hazard
Analyse
โœ… (C) The structural hazard caused by IF and MEM stages both needing memory is eliminated when instruction memory and data memory are separate (Harvard architecture).
Q18

Superscalar processors achieve higher IPC than scalar pipelined processors because:

  1. They use longer pipelines
  2. They issue multiple independent instructions to parallel execution units per cycle
  3. They use faster memory
  4. They eliminate all hazards
Analyse
โœ… (B) Superscalar processors have multiple execution units and hardware to detect independent instructions that can be issued simultaneously, achieving IPC > 1.
Q19

A VLIW processor with 4 operation slots processes a loop where the compiler can only fill 2 slots on average. The utilisation is:

  1. 25%
  2. 50%
  3. 75%
  4. 100%
Analyse
โœ… (B) Utilisation = filled slots / total slots = 2/4 = 50%. The other 2 slots contain NOPs. This is a common problem with VLIW โ€” if the compiler can't find enough parallelism, slots are wasted.
Q20

For Amdahl's Law: if f = 0.95 and p = 20, the speedup is 10.26. If we double f to 0.975 (by optimising serial code), the new speedup with same 20 processors is approximately:

  1. 12.50
  2. 15.24
  3. 20.00
  4. 17.39
Analyse
โœ… (B) S = 1/(0.025 + 0.975/20) = 1/(0.025 + 0.04875) = 1/0.07375 = 13.56. Hmm, let me recalculate. Actually with f = 0.975, p = 20: S = 1/((1โˆ’0.975) + 0.975/20) = 1/(0.025 + 0.04875) = 1/0.07375 โ‰ˆ 13.56. Closest is (B). The point: reducing serial fraction by half nearly doubles the speedup!

Evaluate / Justify (Q21โ€“Q25)

Q21

Intel abandoned the 31-stage Pentium 4 pipeline and returned to a 14-stage Core architecture. The primary reason was:

  1. 31 stages used too much silicon area
  2. Branch misprediction penalties were too severe, negating the clock speed advantage
  3. The 31-stage pipeline couldn't support 64-bit operations
  4. AMD patented deep pipelines
Evaluate
โœ… (B) A 31-stage misprediction wastes 31 cycles. With ~10โ€“15% misprediction rate on real code, the performance loss exceeded the gain from higher clock speeds. The Core architecture trades clock speed for fewer wasted cycles.
Q22

VLIW failed in the general-purpose desktop market (Intel Itanium) primarily because:

  1. It was too fast for desktop applications
  2. General-purpose code has irregular parallelism that compilers struggle to exploit, leading to many wasted instruction slots
  3. VLIW can't execute floating-point operations
  4. Operating systems don't support VLIW
Evaluate
โœ… (B) General-purpose programs have branches, pointer aliasing, and dynamic data dependencies that compilers can't resolve at compile time. This leads to many NOP-filled slots and poor utilisation. Superscalar hardware handles this dynamically at runtime.
Q23

Adding a 2-bit branch predictor instead of a 1-bit predictor primarily helps with:

  1. Reducing pipeline stages
  2. Handling loops where the branch outcome changes at the beginning and end (avoiding double mispredictions)
  3. Increasing clock frequency
  4. Reducing data hazards
Evaluate
โœ… (B) A 1-bit predictor mispredicts twice per loop (at entry and exit). A 2-bit saturating counter requires two consecutive mispredictions to change state โ€” so it only mispredicts once at the loop exit, improving accuracy for loops.
Q24

For a task that is 50% serial, the maximum speedup regardless of processors is:

  1. 2ร—
  2. 4ร—
  3. โˆž
  4. 50ร—
Evaluate
โœ… (A) By Amdahl's Law, max speedup = 1/(1โˆ’f) = 1/0.50 = 2ร—. Even with infinite processors, you can never exceed 2ร— speedup because 50% of the work is inherently sequential.
Q25

Distributed memory systems are preferred over shared memory for 1000+ node clusters because:

  1. They are cheaper per node
  2. Shared memory bus bandwidth cannot scale to 1000+ processors โ€” distributed memory uses local memory with network communication, avoiding the shared bus bottleneck
  3. Distributed systems don't need an operating system
  4. Shared memory cannot use cache
Evaluate
โœ… (B) A shared bus supporting 1000 processors would need impossible bandwidth. Distributed memory gives each processor its own local memory, communicating via a scalable network (InfiniBand, Ethernet). This scales to millions of nodes (like Google's clusters).

Create / Design (Q26โ€“Q30)

Q26

To resolve a load-use hazard without stalling, the compiler should:

  1. Remove the load instruction
  2. Reorder instructions to place an independent instruction between the load and its dependent use
  3. Convert the load to a store
  4. Duplicate the load instruction
Create
โœ… (B) The compiler finds an independent instruction that can be moved into the "delay slot" โ€” the cycle between load and use. This hides the latency without any hardware stall.
Q27

To maximise Amdahl's Law speedup, a software engineer should primarily focus on:

  1. Buying more processors
  2. Reducing the serial (non-parallelisable) fraction of the program
  3. Increasing clock frequency
  4. Adding more cache
Create
โœ… (B) Since max speedup = 1/(1โˆ’f), reducing (1โˆ’f) โ€” the serial fraction โ€” has the greatest impact. Making serial code parallel or optimising it is more effective than adding processors.
Q28

When designing a pipelined processor, the optimal number of stages is determined by:

  1. As many as possible for maximum clock speed
  2. Balancing higher throughput (more stages) against increased hazard penalties and latch overhead
  3. Exactly 5 stages as defined by the standard
  4. The number of registers in the register file
Create
โœ… (B) More stages โ†’ higher clock speed but higher hazard penalties and more latch delay. The sweet spot depends on workload characteristics. Modern CPUs use 10โ€“20 stages as an optimal balance.
Q29

For a web server handling independent HTTP requests, the best parallel architecture choice is:

  1. SISD โ€” one request at a time
  2. SIMD โ€” same operation on all requests
  3. MIMD โ€” each processor handles a different request independently
  4. MISD โ€” multiple processors on one request
Create
โœ… (C) MIMD โ€” Each HTTP request is independent and may require different processing (GET, POST, database query, file serve). MIMD allows each core/processor to handle a different request with different instructions on different data.
Q30

A hardware designer wants to improve the CPI of a pipelined processor. The most effective combination of techniques is:

  1. Deeper pipeline + no branch prediction
  2. Data forwarding + 2-bit branch predictor + Harvard architecture
  3. Wider data bus + more registers
  4. Faster clock + slower memory
Create
โœ… (B) Data forwarding reduces data hazard stalls. 2-bit branch prediction reduces control hazard penalties. Harvard architecture (separate I-cache and D-cache) eliminates structural hazards. Together, these address all three types of pipeline hazards.
Section G

Short Answer Questions (8 Questions)

SA-1

Define pipelining and explain why the ideal speedup of a k-stage pipeline is k, with the formula for actual speedup.

Pipelining is a technique where multiple instructions overlap in execution by dividing the instruction execution into k stages. Each stage handles a different instruction simultaneously. Ideal speedup = k because, at steady state, one instruction completes every cycle. Actual speedup = nk/(k+nโˆ’1). The (kโˆ’1) overhead comes from the pipeline filling phase (first instruction takes k cycles alone). As n โ†’ โˆž, speedup โ†’ k.
SA-2

What is the difference between a RAW hazard and a WAR hazard? Give one example of each with register-level instructions.

RAW (Read After Write): A later instruction reads a register that an earlier instruction writes. Example: I1: ADD R1,R2,R3 followed by I2: SUB R4,R1,R5 โ€” I2 needs R1's new value from I1. WAR (Write After Read): A later instruction writes a register that an earlier instruction reads. Example: I1: ADD R1,R2,R3 followed by I2: SUB R2,R4,R5 โ€” I2 overwrites R2, which I1 still needs to read. RAW is the most common and dangerous; WAR is rare in simple in-order pipelines.
SA-3

Explain Flynn's SIMD classification. Give two real-world examples and state why SIMD is efficient for those applications.

SIMD (Single Instruction, Multiple Data): One instruction operates on multiple data elements simultaneously. Example 1: GPU applying a brightness filter โ€” same multiply operation on millions of pixels. Example 2: Intel AVX-512 adding 16 floating-point numbers in one instruction for scientific simulations. SIMD is efficient because these applications have inherent data parallelism โ€” the same operation is needed on many independent data elements with no dependencies between them.
SA-4

State Amdahl's Law. A program has 70% parallelisable code. Calculate the maximum possible speedup.

Amdahl's Law: Speedup = 1/((1โˆ’f) + f/p), where f = parallel fraction, p = processors. Maximum speedup (p โ†’ โˆž) = 1/(1โˆ’f) = 1/(1โˆ’0.70) = 1/0.30 = 3.33ร—. Even with infinite processors, the 30% serial code limits speedup to 3.33ร—. This law highlights the importance of minimising the serial portion.
SA-5

Compare shared-memory and distributed-memory multiprocessor organisations. State one advantage and one disadvantage of each.

Shared Memory: All processors access common memory. Advantage: Easy programming (shared variables). Disadvantage: Bus contention limits scalability (~16โ€“64 processors). Distributed Memory: Each processor has private memory, communicates via messages. Advantage: Highly scalable (1000s+ nodes). Disadvantage: Complex programming (explicit message passing with MPI). Modern systems like NUMA (Non-Uniform Memory Access) combine both approaches.
SA-6

What is a structural hazard? How does the Harvard architecture solve it?

Structural hazard: Two pipeline stages need the same hardware resource simultaneously. Example: IF stage and MEM stage both need memory access in the same cycle, but there's only one memory port. Harvard architecture solves this by separating instruction memory (I-cache) and data memory (D-cache). IF reads from I-cache while MEM reads/writes to D-cache โ€” no conflict. Modern processors use separate L1 instruction and data caches for this reason.
SA-7

Explain the difference between superscalar and VLIW architectures. Which is dominant in modern general-purpose CPUs and why?

Superscalar: Hardware dynamically detects and issues multiple independent instructions per cycle. Complex hardware, simple compilers. VLIW: Compiler statically packs independent operations into wide instruction words. Simple hardware, complex compilers. Superscalar dominates because general-purpose code has unpredictable branches and memory patterns that hardware can resolve at runtime but compilers cannot resolve at compile time. VLIW works well only for predictable workloads (DSP, embedded).
SA-8

A 2-bit branch predictor has four states. Draw the state diagram and explain why it's better than a 1-bit predictor for loops.

States: Strongly Not Taken (00) โ†’ Weakly Not Taken (01) โ†’ Weakly Taken (10) โ†’ Strongly Taken (11). On correct prediction, move toward extreme (00 or 11). On misprediction, move one step toward opposite. For loops: A 1-bit predictor mispredicts at both loop entry and exit (2 mispredictions per loop). A 2-bit predictor requires two consecutive mispredictions to switch โ€” so it only mispredicts at loop exit (1 misprediction per loop). This nearly halves the misprediction rate for looping code.
Section H

Long Answer Questions (3 Questions)

๐Ÿ“ LA-1: Complete Pipeline Analysis (15 marks)

Question: Consider the following instruction sequence on a 5-stage pipeline (IF, ID, EX, MEM, WB):

I1: LOAD  R1, 100(R0)      ; R1 โ† Memory[R0 + 100]
I2: ADD   R2, R1, R3       ; R2 โ† R1 + R3
I3: SUB   R4, R2, R5       ; R4 โ† R2 โˆ’ R5
I4: BEQ   R4, R0, TARGET   ; Branch if R4 == R0
I5: AND   R6, R1, R7       ; R6 โ† R1 AND R7
I6: OR    R8, R2, R6       ; R8 โ† R2 OR R6

(a) Identify all data hazards (RAW, WAR, WAW). [5 marks]

(b) Draw the pipeline timing diagram without forwarding (show stalls). [4 marks]

(c) Redraw with full forwarding. How many stalls remain and why? [3 marks]

(d) If the branch at I4 is taken with a 2-cycle penalty, redraw the diagram. What is the total execution time? [3 marks]

(a) Data Hazards:

RAW: I1โ†’I2 (R1), I2โ†’I3 (R2), I3โ†’I4 (R4), I1โ†’I5 (R1), I2โ†’I6 (R2), I5โ†’I6 (R6). Total: 6 RAW hazards.

WAR: None (no later instruction writes a register that an earlier instruction is still reading in this in-order pipeline).

WAW: None (no two instructions write the same register).

(b) Without forwarding: I1โ†’I2 causes 2 stalls, I2โ†’I3 causes 2 stalls, I3โ†’I4 causes 2 stalls. Total = 6 stall cycles + 10 normal = 16 cycles.

(c) With forwarding: I1 is a LOAD โ€” load-use hazard with I2 requires 1 stall even with forwarding (data available after MEM, I2 needs it at EX). I2โ†’I3: ALU-ALU forwarding, 0 stalls. I3โ†’I4: ALU-ALU forwarding, 0 stalls. Total = 1 stall. The one remaining stall is because LOAD data isn't available until after MEM stage, but the dependent instruction needs it at the start of EX โ€” one cycle gap that forwarding can't bridge.

(d) With branch taken: Add 2-cycle branch penalty after I4 (I5 and I6 fetched speculatively are flushed, replaced by instructions from TARGET). Total cycles = 11 + 1 (load stall) + 2 (branch penalty) = 14 cycles (approx, depending on target instructions).

๐Ÿ“ LA-2: AWS Case Study โ€” Parallel Processing at Scale (15 marks)

Question: Amazon AWS processes 1 crore (10 million) HTTP requests per second during peak hours. Their infrastructure uses Intel Xeon processors (superscalar, 4-way out-of-order) in servers, NVIDIA GPUs for ML inference, and distributed clusters across availability zones.

(a) Classify each component using Flynn's taxonomy and justify. [4 marks]

(b) An individual web server has a workload that is 85% parallelisable across its 8 cores. Using Amdahl's Law, calculate the speedup. Would upgrading to 16 cores significantly improve performance? [4 marks]

(c) The Xeon processor has a 14-stage pipeline. Calculate the branch misprediction penalty if branches are 18% of instructions and the predictor is 92% accurate. What is the effective CPI? [4 marks]

(d) Explain why AWS uses distributed-memory architecture across data centres rather than one giant shared-memory server. Discuss at least 3 reasons. [3 marks]

(a) Intel Xeon: MIMD (multiple cores, each running different threads with different data). NVIDIA GPU for ML inference: SIMD/SIMT (same neural network operations on different input data batches). Distributed cluster: MIMD (each server handles different requests independently).

(b) 8 cores: S = 1/(0.15 + 0.85/8) = 1/(0.15 + 0.10625) = 1/0.25625 = 3.90ร—. 16 cores: S = 1/(0.15 + 0.85/16) = 1/(0.15 + 0.053125) = 1/0.203125 = 4.92ร—. Improvement from 8โ†’16: only 4.92/3.90 = 1.26ร— (26% improvement). Doubling cores gives diminishing returns โ€” the 15% serial code is the bottleneck. Not very cost-effective.

(c) Branch penalty per misprediction = 14 cycles (pipeline depth). CPI = 1 + 0.18 ร— 14 ร— 0.08 = 1 + 0.2016 = 1.2016. The 8% misprediction rate on a 14-stage pipeline causes a 20% performance loss.

(d) Three reasons: (1) Scalability โ€” shared memory bus can't scale to millions of servers; distributed memory scales linearly. (2) Fault tolerance โ€” if one data centre fails, others continue; shared memory = single point of failure. (3) Geographic latency โ€” servers close to users (Mumbai, Hyderabad) reduce response time; one location serves everyone slowly. Also: (4) Cost โ€” commodity servers are cheaper than one giant supercomputer.

๐Ÿ“ LA-3: Superscalar Processor Design (15 marks)

Question: You are tasked with designing a 4-way superscalar processor for a new Indian-made chip (like SHAKTI from IIT Madras).

(a) What hardware units would you include to support 4-way issue? Draw a block diagram. [4 marks]

(b) Compare in-order vs out-of-order execution for your design. Which would you choose for a general-purpose processor and why? [4 marks]

(c) If 35% of instruction groups (4-instruction bundles) have at least one dependency that reduces issue to 2 instructions, what is the effective IPC? [3 marks]

(d) How does your design compare to a VLIW approach? Discuss at least 4 trade-offs. [4 marks]

(a) Hardware needed: 4 ALUs, 2 FPUs, 2 Load/Store units, 4-wide fetch/decode, Reorder Buffer (ROB) for tracking in-flight instructions, Reservation Stations for dynamic scheduling, Branch Prediction Unit, Register Renaming logic (for eliminating WAR/WAW), Common Data Bus (CDB) for result broadcasting.

(b) Out-of-order (OoO) execution is better for general-purpose code. In-order stalls the entire pipeline when one instruction has a dependency; OoO can skip over blocked instructions and execute later independent ones. IIT Madras's SHAKTI C-class uses in-order for simplicity/power efficiency, while high-performance SHAKTI E-class targets OoO. For a general-purpose desktop/server chip, OoO provides significantly better IPC (typically 1.5โ€“2ร— better than in-order on general code).

(c) 65% of cycles issue 4 instructions, 35% issue 2. IPC = 0.65 ร— 4 + 0.35 ร— 2 = 2.6 + 0.7 = 3.3 effective IPC (out of maximum 4.0). Utilisation = 3.3/4 = 82.5%.

(d) Trade-offs: (1) Hardware complexity: Superscalar high, VLIW low. (2) Compiler complexity: Superscalar low, VLIW high. (3) Binary compatibility: Superscalar maintains backward compatibility, VLIW requires recompilation for new hardware. (4) Power efficiency: VLIW better (simpler hardware). (5) Code density: Superscalar compact instructions, VLIW has NOP bloat. (6) Real-world IPC: Superscalar higher on general code, VLIW competitive on predictable workloads.

Section I

Industry Spotlight โ€” A Day in the Life

๐Ÿ‘จโ€๐Ÿ’ป Karthik Reddy, 30 โ€” CPU Design Engineer at Arm India, Bengaluru

Background: B.Tech ECE from NIT Warangal. Joined Arm as a Graduate Engineer after campus placement. Now 7 years in, leads a team of 5 engineers working on the next-generation Arm Cortex-A series pipeline design.

A Typical Day:

8:30 AM โ€” Morning sync with the Austin, Texas team (12-hour time zone overlap). Review pipeline performance regression reports from overnight simulations.

9:30 AM โ€” Work on RTL (Register Transfer Level) code in SystemVerilog. Currently optimising the branch predictor for the next Cortex core. "We're trying to reduce misprediction rate from 4.2% to 3.8% โ€” sounds small, but on a 13-stage pipeline, each 0.1% improvement saves millions of wasted cycles per second."

11:00 AM โ€” Run pipeline simulations using Arm's internal toolchain. Analyse IPC numbers across SPEC CPU2017 benchmarks. "Today we discovered that our new forwarding path improves SPEC integer score by 2.3%."

1:00 PM โ€” Lunch at Arm's Bengaluru campus (Outer Ring Road). Chat with the verification team about corner-case bugs in the out-of-order engine.

2:30 PM โ€” Design review meeting. Present the modified Reorder Buffer (ROB) design that increases from 192 to 256 entries. Team debates power vs performance trade-offs.

4:00 PM โ€” Mentor two junior engineers on pipeline hazard analysis. "I tell them: if you understand the 5-stage pipeline from your B.Tech COA course, you understand the foundation. Our 13-stage Cortex pipeline is the same concept, just deeper and wider."

5:30 PM โ€” Submit RTL changes for overnight regression testing. Write documentation for the new forwarding logic.

DetailInfo
Tools Used DailySystemVerilog, Arm cycle-accurate simulators, SPEC benchmarks, VCS (Synopsys), Python (scripting), Git
Entry Salary (2024)โ‚น12โ€“18 LPA + stock options
Mid-Level (5โ€“7 yrs)โ‚น25โ€“40 LPA
Senior/Lead (10+ yrs)โ‚น50โ€“80 LPA
Companies HiringArm India, Intel India, AMD India, Qualcomm India, Samsung Semiconductor, NVIDIA India, MediaTek, Texas Instruments India, IIT Madras SHAKTI, CDAC
Karthik's advice for students: "Master your COA fundamentals โ€” pipelining, hazards, Amdahl's Law. In my interview at Arm, I was asked to draw a 5-stage pipeline, identify hazards, and propose forwarding paths. It was exactly what I learned in my B.Tech Unit 8. The foundation doesn't change โ€” we just scale it up."
Section J

Earn With It โ€” Career & Income Roadmap

๐Ÿ’ฐ Your Earning Path After This Chapter

This chapter's knowledge unlocks three earning streams:

โ€ข GATE Coaching: Pipeline speedup, Amdahl's Law, and Flynn's classification appear in GATE CS/EC every year (~2 marks). Master these โ†’ score more โ†’ better IIT/NIT M.Tech admission โ†’ โ‚น8โ€“15 LPA placements.

โ€ข HPC Consulting/Freelancing: Understanding parallel processing lets you optimise software for multi-core systems. Companies pay โ‚น50,000โ€“โ‚น2,00,000 for performance optimisation projects.

โ€ข CPU/VLSI Design Jobs: Arm India, Intel India, AMD, Qualcomm โ€” all hiring in Bengaluru, Hyderabad, and Noida. Entry salary โ‚น12โ€“25 LPA for B.Tech/M.Tech in ECE/CSE.

Earning PathWhat You NeedPotential Earnings
GATE CS/EC CoachingStrong grasp of pipeline numericals + practiceTop 500 rank โ†’ M.Tech IIT โ†’ โ‚น15โ€“30 LPA packages
Online Tutoring (COA)Explain pipeline concepts clearly on YouTube/Unacademyโ‚น5,000โ€“โ‚น25,000/month from tutoring
HPC InternshipPython + understanding of parallelism + OpenMP/MPI basicsโ‚น15,000โ€“โ‚น40,000/month (C-DAC, ISRO, IISc internships)
VLSI Design EntryB.Tech ECE/CSE + pipeline design knowledge + Verilog basicsโ‚น12โ€“18 LPA (Arm, Intel, Qualcomm campus placements)
Performance Optimisation FreelanceMulti-threading, profiling, parallel algorithm designโ‚น50,000โ€“โ‚น2,00,000/project
Immediate action: Create a GitHub repository called "parallel-processing-notes" with your pipeline calculator Python code, solved GATE numericals as markdown files, and pipeline diagrams as images. This becomes a portfolio piece for both GATE preparation and job applications.
Section K

Chapter Summary

๐Ÿง  Key Takeaways โ€” Unit 8: Parallel Processing

1. Pipelining overlaps instruction execution across 5 stages (IFโ†’IDโ†’EXโ†’MEMโ†’WB). Speedup = nk/(k+nโˆ’1). Ideal speedup approaches k (number of stages) as n โ†’ โˆž.

2. Pipeline Hazards prevent ideal performance: Structural (resource conflict โ†’ Harvard architecture), Data (RAW/WAR/WAW โ†’ forwarding/stalling), Control (branch misprediction โ†’ branch prediction).

3. Flynn's Classification categorises architectures: SISD (classic CPU), SIMD (GPU/vector), MISD (rare), MIMD (multi-core/clusters). Modern systems are predominantly MIMD.

4. Vector/Array Processors operate on entire arrays with single instructions. GPUs are modern massive SIMD processors with thousands of cores.

5. Multiprocessor Organisation: Shared memory (easy programming, limited scalability) vs Distributed memory (hard programming, massive scalability). Modern systems use NUMA as a hybrid.

6. Superscalar vs VLIW: Superscalar uses hardware scheduling (dominant in general-purpose). VLIW uses compiler scheduling (niche in embedded/DSP).

7. Amdahl's Law: Speedup = 1/((1โˆ’f) + f/p). Maximum speedup = 1/(1โˆ’f). The serial fraction is the ultimate bottleneck โ€” optimise it first!

8. Real-world connection: AWS's 1 crore requests/sec, India's PARAM supercomputers, Arm Bengaluru's pipeline design โ€” all built on these fundamentals.

Quick Formula Sheet
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚  Pipeline Speedup    = nk / (k + n โˆ’ 1)                      โ”‚
  โ”‚  Pipeline Efficiency = n / (k + n โˆ’ 1) = Speedup / k         โ”‚
  โ”‚  Effective CPI       = 1 + ฮฃ(freq ร— penalty ร— miss_rate)     โ”‚
  โ”‚  Amdahl's Speedup    = 1 / ((1โˆ’f) + (f/p))                   โ”‚
  โ”‚  Amdahl's Max        = 1 / (1โˆ’f)        [when p โ†’ โˆž]         โ”‚
  โ”‚  Throughput           = n / Total_cycles  [instructions/cycle] โ”‚
  โ”‚  IPC (superscalar)   = Instructions / Cycles                  โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Section L

Earning Checkpoint

Skill LearnedTool / MethodDeliverableEarning Ready?
Pipeline ConceptsPen-and-paper diagramsSpace-time diagrams for 7 instructionsโœ… Yes โ€” can tutor juniors / GATE aspirants
Speedup CalculationsFormula + Python calculatorWorked numericals + Python scriptโœ… Yes โ€” GATE exam + portfolio piece
Pipeline HazardsHazard detection + forwardingAnnotated hazard analysisโœ… Yes โ€” interview-ready COA knowledge
Flynn's ClassificationConceptual + diagramsClassification chart with examplesโœ… Yes โ€” GATE + interview prep
Amdahl's LawFormula + PythonSpeedup table for various f and p valuesโœ… Yes โ€” HPC consulting foundation
Superscalar/VLIWComparison analysisComparison tableโœ… Yes โ€” CPU design interview prep
Python Pipeline ToolPython programmingGitHub repository with calculatorโœ… Yes โ€” portfolio for internship applications
Minimum Viable Earning Setup after this chapter: A GATE-ready understanding of pipeline numericals (guaranteed 2 marks in GATE CS/EC every year) + a Python portfolio project on GitHub + ability to tutor COA to juniors (โ‚น500โ€“โ‚น1,000/hour on Chegg/Doubtnut). Start now!

โœ… Unit 8 complete. You now understand how modern CPUs achieve billions of operations per second!

[QR: Link to EduArtha video tutorial โ€” Parallel Processing & Pipelining]