### **Green Computing Platforms**

## Step1. CA0602+0604: Pipeline, Superscalar and VLIW

http://archlab.naist.jp/Lectures/ARCH/ca0602\_0604/ca060204e.pdf

Copyright © 2024 NAIST Y.Nakashima

### **Download the template and submit through UNIPA.**

http://archlab.naist.jp/Lectures/ARCH/ca0602/ca0602e.docx

in <a href="http://archlab.naist.jp/Lectures">http://archlab.naist.jp/Lectures</a>

## **Modern Computing Platforms**

3 SINGLE **5stage pipeline CPU** for education **CC-NUMA** Manycore MIMD/SIMD Coalescing **GPGPU CPU** based High-speed Compiler Long Vector Main memory 4way Ring+Cache CGLA gen2 Coarse/CISC CGLA gen3 I way Ring+Cache Programmable **Cache Memory** PiM Coarse/RISC Main memory **CGRA FPGA** based Mult+Accumulate Systolic Array Low-speed P&R **Fine Grained BRAM+DDR FPGA Special HW** No flexibility Analog Memory Cell CiM

#### (1) Parallelization in CPU core (single program counter)(1960-1970)

| 1966 (TI ASC) Pipelining          | 0    | 0    | 0    | 0      | 0    | 0        | 0    | 0     | 0      | 0         | 0    | 0              | 0    |                                         |
|-----------------------------------|------|------|------|--------|------|----------|------|-------|--------|-----------|------|----------------|------|-----------------------------------------|
| 1964 (CDC6600) Super Scalar       |      |      |      | 0      |      | 0        |      |       | 0      |           |      |                |      | <b>OHM</b><br>大学テキスト                    |
| 1965 (ILLIAC IV) Vector Processor | 0    | 0    |      |        | 0    |          | 0    |       |        | 0         |      |                |      |                                         |
| 1976 (QA1) VLIW                   | 0    | 0    | 0    | 0      | 0    |          |      | 0     |        |           | 0    |                |      | コンピークアーナニクチャ                            |
| 1982 (HEP) Multithreading         | 0    |      |      | 0      | 0    | 0        | 0    | 0     |        |           |      | 0              |      | コンピュータアーキテクチャ<br><sub>中島康彦</sub> — [編者] |
| 1982 (CMU) Systolic Array         | 0    | 0    | 0    |        |      |          |      |       |        |           |      |                | 0    |                                         |
|                                   | 2016 | 2012 | 2008 | 2006   | 1992 | 2002     | 2000 | 1997  | 1964   | 1965      | 1976 | 1982           | 1982 |                                         |
|                                   |      |      | Γ    |        |      |          | _    | -     | -      | <         | -    |                | 0    |                                         |
|                                   | IMAX | EMAX | APP  | OROCHI | VPP  | Hyper    | GPU  | DAISY | Super  | ector     | LW   | ıltitl         | GRA  |                                         |
|                                   |      |      |      | Ĭ      |      |          |      |       | Sca    |           |      | hrea           |      |                                         |
|                                   |      |      |      |        |      | hreading |      |       | Scalar | Processor |      | Multithreading |      | <b>●</b> Ħ <b>M</b>                     |
|                                   |      |      |      |        |      | ling     |      |       |        | SSO       |      | UQ.            |      | Öhmsha                                  |
|                                   |      |      |      |        |      |          |      |       |        | 7         |      |                |      |                                         |

#### 20220401 **5**

# **History of high-speed digital computers**





(3) Parallelization in system (Multiple memory space) Distributed system (message) Large scale system





From instruction fetch to storing result

# Systematic management of ALU, Register and Memory



h

#### ► 1. Fetch

Fetch an instruction from PC addr. of main memory

2. Decode

Decode the instruction and read data from register

Prepare control signals for following stages

3. Execute

Execute arithmetic, logical, shift or other operation

For load/store instruction, calculate effective address (add or sub)

► 4. Memory

For load instruction, read data from main memory to register For store instruction, write data from register to main memory

Other instruction, do nothing

► 5. Write

Store result of execution or load data into register



► 1. Fetch

Supplying addr. to instruction cache to fetch an instruction Read latency of instruction cache

2. Decode

**Extracting register numbers form the instruction** 

Then supplying them to register file to get resister contents Read latency of register file

► 3. Execute

Supplying input value to getting output through LogN step gates Latency of LogN step gates

4. Cache

For load instruction, supplying addr. to getting data

For store instruction, supplying addr. to writing data

**Read/write latency of data cache** 

▶ 5. Write

Supplying register number to storing result of execution or load Write latency of register file



# Just do it sequentially ... like ancient 8bit CPU (no cache)



### Fast execution by stage parallel execution

- Since temporal storage (pipeline registers) is required, Improvement is smaller than the number of stages
- Even slower, if execution time of each stage is not balanced

|       |       |        |         |         |         |         |         |       |       | Т |
|-------|-------|--------|---------|---------|---------|---------|---------|-------|-------|---|
| inst1 | Fetch | Decode | Execute | Cache   | Write   |         |         |       |       |   |
|       | inst2 | Fetch  | Decode  | Execute | Cache   | Write   |         |       |       |   |
|       |       | inst3  | Fetch   | Decode  | Execute | Cache   | Write   |       |       |   |
|       |       |        | inst4   | Fetch   | Decode  | Execute | Cache   | Write |       |   |
|       |       |        |         | inst5   | Fetch   | Decode  | Execute | Cache | Write |   |



**Cache latency is a guideline of "stage latency"** 

### Level 1 cache (small and fast)

Ex. 64byte width x 256 (16K byte)

For simple CPU, cache speed roughly determines total performance

### Register file (smaller and fast)

Ex. 32bit width x 32 (128 byte)

Fast enough but double speed operation is required (discuss later)

### LogN stage logic gates

Ex. 32bit adder (thousands of transistors)

Fast enough but write latency is not negligible due to its long path (discuss 3<sup>rd</sup> day)

Other basic components (Longer latency than pipeline stage latency)

- Level 2 cache (middle capacity and middle speed)
   > 2MB
- Main memory (large but small)
   > 2GB





When CK=0, input value is propagated to 1<sup>st</sup> stage, output is unchanged



When CK=1, input value is propagated to output





#### **Operation of master slave latch**

When CK个, each pipeline stage starts operation

- If result of previous stage is arrived, propagation is blocked by latch
- If result is not arrived when CK个, propagation is failed Therefore, to supply higher frequency clock without latency improvement causes malfunction

Clock skew is also common reason of malfunction





#### Locate pipeline registers between stages

#### **Clock and pipeline register**

By connecting output to input of previous stage, information can be hold Ex. Use +4 logic as combinational logic to make program counter





#### Source code

```
int R1[100];
int R2;
R1[1] = R1[1]+8;
R1[R2]= R1[1]-R1[R2];
```

```
/* reg r1: top address of array R1[] */
/* R2: index for array R1[] */
/* add 8 to second element of R1[] */
/* reg r2: R2 */
```

#### Instructions

| ld  | r1,4,r4     | r4 ← mem[r1+4]      |
|-----|-------------|---------------------|
| add | r4,8,r8     | r8 ← r4 + 8         |
| st  | r1,4,r8     | Mem[r1+4] ← r8      |
| ld  | r1,r2<<2,r5 | r5 ← mem[r1+r2*4]   |
| sub | r8,r5,r9    | r9 ← r8 <b>–</b> r5 |
| st  | r1,r2<<2,r9 | Mem[r1+r2*4] ← r9   |

#### **Computer repeats Decode, Read, Execution and Write-back.**



20220401

Read PC addr. of Main memory and write result to instruction buffer





# Decode instruction then prepare control signals for ALU, input data, output register number





Select one of N registers and Write or Read it

- For write, decode logN bit then set one of N WriteSel&CK to 0
- For read, decode logN bit then set one of N Read sel to 1





#### **Overall structure of register file**

For every cycle execution, following function is required for register file

- Read from either same or different two registers (two read port)
- Write to one register (one write port)

At the end of first half cycle, Wdata is available at Rdata At the end of second half cycle, Rdata is written in pipeline register



### Antiphase operation of pipeline register and register file

If register file is composed of latch, half cycle READ/WRITE are possible

- Write in first half cycle and read in second half cycle
- Register bypassing (explain later) can be omitted

If register file is composed of memory, half cycle operations are not possible

Register bypassing (explain later) is required





#### **Execute**

Do operation that is specified by control signal, result is stored temporally

Do address calculation for load/store instructions







#### **Cache – Read operand**

Access memory with calculated address and store result temporally

- If virtual address cache, TLB is in this stage
- If store buffer is implemented, it is also in this stage





Write result to register file

- Note that there is register file in Decode stage
- If it is composed of latch, half cycle READ/WRITE is possible
- If it is composed of memory, half cycle operation is not possible





#### Simply connect them together



#### **Connect them correctly (Decode and Write are in one cycle)**



#### Load then add (Id-use penalty if dependency is exists)



When load is done

When execution of add is done

### Add then Store (no penalty by bypassing)



When execution of add is done

When add is completed



# **Inefficiency of fake high-frequency**

| 命令1 | Fetch | Decode | Execute | Cache   | Write |       |
|-----|-------|--------|---------|---------|-------|-------|
|     | 命令2   | Fetch  | Decode  | Execute | Cache | Write |

#### **High-speed device for true speed-up (x2 frequency)**

| 命令1 | Fetch | Decode | Execute | Cache   | Write |       |
|-----|-------|--------|---------|---------|-------|-------|
|     | 命令2   | Fetch  | Decode  | Execute | Cache | Write |



#### Super pipelining for fake speed-up (x2 frequency)

| 命令1 | Fetch | Fetch | Decode | Decode | Execute | Execute | Cache   | Cache | Write | Write |       |
|-----|-------|-------|--------|--------|---------|---------|---------|-------|-------|-------|-------|
|     | 命令 2  | Fetch | Fetch  | Decode | Decode  | Execute | Execute | Cache | Cache | Write | Write |

#### Fake speed-up (x4 frequency)

4



#### **Dependent instruction is delayed (same speed)**



#### Branch instruction is also delayed (very slow)



# Importance of gathering instructions

### High-frequency is inefficient. (Power $\propto$ F^3)

- Employing parallelism is most important
- First superscalar executes neighbor two instructions

### Modern superscalars gather from large window.



#### Accurate branch prediction is required





Three types of dependency in programs

- Flow dependency (write ⇒ read) Succeeding insn refers the result of preceding insn add r1,r2 → r8 sub r8,r3 → r4 Fundamental data dependency is hard to resolve
- Anti-dependency (read ⇒ write) Succeeding insn should wait for preceding read
  - sub  $r_8, r_3 \rightarrow r_4$

add r5,r6  $\rightarrow$  r8 Register renaming can remove dependency

# Output dependency (write⇒ write)

Succeeding insn should wait for preceding write

```
add r1,r2 \rightarrow r8
sub r8,r3 \rightarrow r4
add r5,r6 \rightarrow r8
sub r8,r3 \rightarrow r7
Register renaming can remove dependency
```



# **Basic idea to remove dependency**

- The usage of registers are defined by ABI (application binary interface), then compilers don't have enough registers.
- Register number is just representing data dependency. No need to assign physical register as specified.
- If a new destination register is assigned to every insn, anti-dependency and output dependency are eliminated.
  - Specified register number: Architectural register
  - Real register number: Physical register



# **Register renaming**

### Previous example: no parallelism

add  $r1, r2 \rightarrow r8$ sub  $r8, r3 \rightarrow r4$ add  $r5, r6 \rightarrow r8$ sub  $r8, r3 \rightarrow r7$ 

#### **Renamed destination registers**

add  $r1, r2 \rightarrow p0(r8)$ sub  $r8, r3 \rightarrow p1(r4)$ add  $r5, r6 \rightarrow p2(r8)$ sub  $r8, r3 \rightarrow p3(r7)$ 

#### **Remapped source registers**

add r1 , r2  $\rightarrow$  p0 sub p0(r8), r3  $\rightarrow$  p1 add r5 , r6  $\rightarrow$  p2 sub p2(r8), r3  $\rightarrow$  p3

#### 2-instructions/cycle doubles the performance

```
add r1,r2 \rightarrow p0 \succeq add r5,r6 \rightarrow p2
sub p0,r3 \rightarrow p1 \succeq sub p2,r3 \rightarrow p3
```



When each physical register is released ?

Physical register seems permanently blocked, because future consumer is unknown.

After correct branch-direction is determined and preceding insns are finished, architecture register is confirmed.

Then corresponding physical register is released.

Physical register scheme: uses unified register space
 Map for "arch-reg⇒phys-reg" is maintained
 Reorder buffer scheme: uses architectural register space



# **Physical register scheme**

#### Before new branch, snapshot of reg-map is required.



# **Reorder buffer scheme**

#### Destination register is allocated in reorder buffer.







### **Instruction window**

#### It is easy to issue neighbor instructions. But parallelism is limited.



## Keep many instructions in rename-retire window, and select as many as possible.





## Where instructions are enqueued

Reservation station scheme: just before execution stage All operands should be read before enqueued.



Centralized instruction window scheme: just before regread stage



# For back-to-back execution, the result should be forwarded to next execution every cycle.





## **Centralized instruction window scheme**

#### After issued, registers are read. "Select & read" stage is critical.







## **VLIW (very long instruction word)**

Compiler statically schedules instructions in long-format.

- Binary code is dedicated to specific parallel architecture.
- Simple hardware can reduce power consumption.



🖗 NAIST.

### **Example of scheduling**





## **Packing of instructions**

| 21 | :ld     | Z(K+10)    | fmul | Y(K )*i>m  |      |             |
|----|---------|------------|------|------------|------|-------------|
| 2  | ld      | Z(K+11)    | fmul | Y(K+1)*j>n |      |             |
| 2  | ld      | Z(K+12)    | fmul | Y(K+2)*k>0 |      |             |
| 2  | ld      | Z(K+13)    | fmul | Y(K+3)*1>p | -    |             |
| 2  | ld      | Z(K+14)    | fadd | Q+m>X(K)   | fmul | R*Z(K+10)>a |
| 3  | add     | Z()+4>Z()  | fadd | Q+n>X(K+1) | fmul | R*Z(K+11)>b |
| 3  | add     | Y()+4>Y()  | fadd | Q+o>X(K+2) | fmul | R*Z(K+12)>c |
| 2  | ld      | Y(K+3)     | fadd | Q+p>X(K+3) | fmul | R*Z(K+13)>d |
| 2  | ALL NOW |            | st   | X(K )      | fmul | T*Z(K+11)>e |
| 2  |         |            | st   | X(K+1)     | fmul | T*Z(K+12)>f |
| 2  |         |            | st   | X(K+2)     | fmul | T*Z(K+13)>g |
| 2  |         |            | st   | X(K+3)     | fmul | T*Z(K+14)>h |
| 4  | add     | X()+4>X()  | ld   | Y(K )      | fadd | a+e>i       |
| 4  | cmp     | K,400      | ld   | Y(K+1)     | fadd | b+f>j       |
| 4  |         |            | ld   | Y(K+2)     | fadd | c+g>k       |
| E  | brc     | <u>l,L</u> |      |            | fadd | d+h>l       |

VLIW: 380perations, 128bytes Superscalar: 380perations, 152bytes



Wide-issue VLIW requires global scheduling beyond basic blocks.

- Software pipelining
- Loop unrolling
- Trace scheduling
- Percolation scheduling



## **Software pipelining**

Scheduling for eliminating bubbles in pipeline

#### Consider latency of each instruction.

ldr1, 4, r8addr  $(r1+4) \Rightarrow r8$ 1cycle bubbleaddaddr8, 8, r10addstr1, 4, r10 $r10 \Rightarrow addr (r1+4)$ ldr1, r5 << 2, r11addr  $(r1+r5*4) \Rightarrow r11$ 1cycle bubblesub r10, r11, r12subtractstr1, r5 << 2, r12 $r12 \Rightarrow addr (r1+r5*4)$ 

#### Rescheduling can reduce execution time.

ld r1,4,r8
ld r1,r5<<2,r11
add r8,8,r10
sub r10,r11,r12
st r1,4,r10
st r1,r5<<2,r12</pre>



## Loop unrolling

addr (r1+r5\*4)  $\Rightarrow$  r8 loop: ld r1,r5<<2,r8 **1cycle bubble** floating add fadd r8,r9,r10 **2cycles bubble** r1,r5<<2,r10  $r8 \Rightarrow addr (r1+r5*4)$ st add r5,1,r5 update index comp r5,30repeat 30 times ? bl loop

#### N-unrolling needs many registers but can increase speed. (case N=3)

```
loop: ld r1,r5<<2,r8
                          assume r^2 = r^{1+4}
         r2,r5<<2,r12
      ld
      ld
           r3,r5<<2,r16
                          assume r_3 = r_{1+8}
      fadd r8,r9,r10
      fadd r12, r9, r13
      fadd r16,r9,r17
      st r1,r5<<2,r10
      st r2,r5<<2,r13
         r3,r5<<2,r17
      st
      add
          r5,3,r5
      comp r5,30
     bl
           loop
```



## **Trace scheduling**

- Get frequently executed path by trial-execution.
- Schedule instructions in main-path beyond basic-block.

Code size often explodes with many basic-blocks.





### **Percolation scheduling**

#### Suppress code size explosion to same extent.





- Superscalar has complicated hardware for speculation.
- If compiler does such aggressive scheduling as superscalar, hardware can be simplified.
- However, compiler cannot take risk to crash user programs by illegal memory access.
- Compiler needs special instructions that can suppress illegal memory access.



### **Speculation by compiler**

## "Non-faulting load/execution" and "checking store" instructions for VPP5000 supercomputer

| Source program | Normal insn    | With speculation support       |  |  |  |
|----------------|----------------|--------------------------------|--|--|--|
|                |                | dld a -> r1 (speculation)      |  |  |  |
|                |                | dld b -> r2 (speculation)      |  |  |  |
|                |                | dadd r1+r2 -> r3 (speculation) |  |  |  |
| if (cond) {    | branch on cond | branch on cond                 |  |  |  |
| *x = *a + *b;  | ld a           | cst r3 -> x (check)            |  |  |  |
|                | ld b           |                                |  |  |  |
|                | add a+b        |                                |  |  |  |
| }              | st x           |                                |  |  |  |

#### "Non-faulting load/check" instructions for IA64 (intel)

```
ld.s a -> r1 (speculation)
branch on cond
chk.s r1,retry
ret: ...
retry: ld a -> r1 (retry) )
branch ret
```



# Compiler is conservative if no architectural support for speculation.

Architects should consider special hardware to support compilers to make hardware more simple.



