# CS 433 Mini–Project: ARM Cortex–A78

Nandeeka Nayak, Jovan Stojkovic, Antonis Psistakis

### Microarchitecture Overview



### Pipeline Overview



### Pipeline Overview



### Instruction Fetch: Branch Prediction

Hardware includes:

- A branch direction predictor using previous branch history
- A static branch predictor
- Branch Target Buffer (BTB)
- The return stack, a stack of nested subroutine return addresses
- An indirect branch predictor

### Pipeline Overview



### Instruction Decode and Dispatch



### Pipeline Overview



# Memory Hierarchy (1/3)

### Macro-OP \$:

- 1.5K entries
- 4-way Skewed Associative
- Virt. Indexed Virt. Tagged (VIVT) behaves as Phys. Indexed Phys. Tagged (PIPT)



# Memory Hierarchy, Cnt'd (2/3)

### L1 TLB:

- separate for Data & Instr.
- Fully Associative
- 32 entries
- Access: typically 1 cycle

### L1 \$:

- separate for Data & Instr.
- 4-way Set Associative
- Cache line: 64 Bytes
- Size: 32/64 KB (configurable)
- Virt. Ind. Phys. Tag. (VIPT) behaves as PIPT
- Pseudo-LRU
- MESI (cache coherence)



# Memory Hierarchy, Cnt'd (3/3)

### L2 TLB:

- same for Data & Instr.
- 4-way Set Associative
- 1024 entries
- Access: typically 3 cycles

### L2 \$:

- same for Data & Instr.
- 8-way Set Associative
- Cache line: 64 Bytes
- Size: 256/512 KB (configurable)
- Inclusive (strictly: L1D, weakly: L1I)
- MESI (cache coherence)



# big.LITTLE

Heterogeneous processing architecture that uses two types of processor

- "LITTLE" processors -> maximum power efficiency
  - texting, email, audio
  - Cortex-A53, Cortex-A55
- "big" processors -> maximum compute performance
  - mobile gaming and browsing
  - Cortex-A73, Cortex-A75, Cortex-A76, Cortex-A77, Cortex-A78





### DynamIQ cluster

#### Cluster microarchitecture ==>

- One or more cores
- DSU

#### Dynamic Shared Unit (DSU) ==>

- L3 memory system
- Control logic
- External Interfaces

#### Two configurations ==>

- A set of cores having the same microarchitecture
- Two sets of cores, where each set has a different microarchitecture



### DSU Components

#### L3 cache

- 2MB to 4MB
- 16-way set associative
- 64-byte cache line

#### SCU (Snoop Control Unit)

• maintains coherency

Power state control

• External power controller

#### DSU system control registers

• Set of registers accessible from any core in the cluster

#### ACP slave

Accelerator Coherency Port



### Cortex-A78 with DynamIQ

- Up to eight cores
- Up to four Cortex-A78s may be clustered together
- The cluster may also include up to four additional little cores such as the Cortex-A55 in a big.LITTLE configuration
- One or more of the A78 cores may be swapped out for a Cortex-X1 core in order to achieve even higher performance
- Compared to a quad-core A77 cluster on 7 nm, a quad-core A78 cluster on 5 nm provides +20% sustained performance improvement while reducing the silicon area by about 15%.

### Use cases

- AR, XR
- ML, Al
- Edge Computing
- AAA Gaming as an exciting use-case
- A78 CPU + Mali-G78 GPU
  - high-fidelity gaming experiences
  - longer battery life on smartphones for extended and enhanced 'all-day-play'
  - Samsung Exynos 1080



### Power Management

**Dynamic: clock gating, DVFS** 

Static: dynamic retention, powerdown

#### **Voltage Domains**

#### **Power Domains**





# Energy Efficiency

Various components optimized:

#### • Branch prediction

- Supports 2 taken branches per cycle
- Increased accuracy for conditional branches
- Mid-core and pipeline
  - More instruction fusions
  - ROB: increased instruction per unit area
- Memory sub-system
  - Extra Addr. Generation Unit (AGU): 50% <sup>(1)</sup> load bandwidth
  - Improved data prefetching: memory area coverage, accuracy and timeliness

Source:

1) Arm's New Cortex-A78 and Cortex-X1 Microarchitectures: An Efficiency and Performance Divergence, anandtech.com

### References

- 1. <u>Arm Cortex-A78 Core Technical Reference Manual</u>, ARM.
- 2. <u>Cortex-A78 Microarchitectures ARM</u>, Wikichip.
- 3. <u>Arm's New Cortex-A78 and Cortex-X1 Microarchitectures: An Efficiency and</u> <u>Performance Divergence</u>, anandtech.com
- 4. <u>Arm Unveils the Cortex-A78: When Less Is More</u>, Wikichip.
- 5. <u>Sustained performance through Arm Cortex-A78 CPU Processors blog Processors</u>, ARM.
- 6. <u>Arm Cortex-A78 Core Software Optimization Guide</u>, ARM.
- 7. <u>ARM DynamIQ Shared Unit Technical Reference Manual</u>, ARM.
- 8. Seznec A., "<u>A Case for Two-Way Skewed-Associative Caches</u>", ISCA 1993.
- 9. Mutlu O., Comp. Arch., "<u>High Performance Caches</u>", CMU, Spring 2015.
- 10. Al-Zoubi H. et al, "<u>Performance evaluation of cache replacement policies for the SPEC</u> <u>CPU2000 benchmark suite</u>", ACM-SE 2004.
- 11. <u>Arm's New Cortex-A77 CPU Micro-architecture: Evolving Performance</u>, anandtech.com
- 12. <u>Cortex-A15 Technical Reference Manual</u>. ARM.

# Thank you!

### **Backup Slides**

### Cortex-A78 core operations

| Instruction groups             | Instructions                                                                                                                                     |  |
|--------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|--|
| Branch 0/1                     | Branch µOps                                                                                                                                      |  |
| Integer Single-Cycle 0/1       | Integer ALU µOPs                                                                                                                                 |  |
| Integer Single/Multi-cycle 0/1 | Integer shift-ALU, multiply, divide, CRC and sum-of-absolute-differences $\mu OPs$                                                               |  |
| Load/Store 0/1                 | Load, Store address generation and special memory $\mu OPs$                                                                                      |  |
| Load 2                         | Load µOPs                                                                                                                                        |  |
| Store data 0/1                 | Integer store data µOPs                                                                                                                          |  |
| FP/ASIMD-0                     | ASIMD ALU, ASIMD misc, ASIMD integer multiply, FP convert, FP misc, FP add, FP multiply, FP divide, FP sqrt, crypto µOPs, Vector store data µOPs |  |
| FP/ASIMD-1                     | ASIMD ALU, ASIMD misc, FP misc, FP add, FP multiply, ASIMD shift μOPs, Vector store data μOPs, crypto μOPs.                                      |  |

### Dispatch Constraints

| Execution Unit                   | # of µOps |
|----------------------------------|-----------|
| Store data (all) / Branch (all)  | 4         |
| Integer Single/Multi-cycle (all) | 4         |
| Integer Single/Multi-cycle 0     | 2         |
| FP/ASIMD-0                       | 2         |
| FP/ASIMD-1                       | 2         |
| Load (all)                       | 6         |

### SIMD Instructions With NEON



# NEON register set

| S0-S31<br>VFP only | D0-D15<br>VFPv 2 or<br>VFPv 3-D16 | D0-D31<br>VFPv 3-D32 or<br>Advanced SIMD | Q0-Q15<br>Advanced<br>SIMD only |
|--------------------|-----------------------------------|------------------------------------------|---------------------------------|
| S0<br>S1           | D0                                | D0                                       |                                 |
|                    |                                   |                                          | Q0                              |
| S2                 | D1                                | D1 —                                     |                                 |
| S3                 |                                   |                                          |                                 |
| S4                 | D2                                | D2                                       |                                 |
| S5                 | 52                                | DL                                       | Q1                              |
| S6                 | D3                                | D3                                       | - 91 -                          |
| S7                 |                                   |                                          |                                 |
|                    |                                   |                                          | · ·                             |
| ÷ . ≈              | F + · +                           | <u></u>                                  | ÷ ÷                             |
|                    |                                   | · · · · · · · · · · · · · · · · · · ·    |                                 |
| S28                | — D14 —                           | D14                                      |                                 |
| S29                |                                   |                                          | Q7                              |
| S30                | D15                               | D15                                      |                                 |
| S31                |                                   |                                          |                                 |
|                    |                                   | D16                                      |                                 |
|                    |                                   |                                          | Q8                              |
|                    |                                   | D17                                      |                                 |
|                    |                                   |                                          |                                 |
|                    |                                   |                                          |                                 |
|                    |                                   | D30                                      |                                 |
|                    |                                   |                                          | Q15 —                           |
|                    |                                   | D31                                      |                                 |
|                    |                                   |                                          |                                 |

### **Skewed Associativity**



- **Key idea**: Reduce cache conflicts by randomizing the set index using a different hash function for each way.
- Adv.: reduces conflict missesSources:Disadv.: hash functions induce extra latency1) Seznec A., "A Case for 2-Way Skewed-Assoc. Caches", ISCA'93<br/>2) Mutlu 0., Comp. Arch., High Performance Caches, CMU, Spr'15

# Write Streaming Mode

Cache line allocation (linefill): upon a read/write miss.

However, not all writes require allocation: Writes of large blocks of data can unnecessarily **pollute the cache**, e.g. memset(). (waste of energy & performance)

#### Write Streaming Mode:

- Stores lookup cache, but upon miss they write out to L2 (L3, or DRAM)
- Loads behave as normal

L1 memory system of A78 includes logic to detect when the core has stores pending to full cache line OR upon a DCZVA (full cache line write to zero)

### Pseudo-LRU (PLRU)

Key Idea: Replace cache lines based on approximate age rather than the actual age.

Compared to typical LRU, PLRU improves performance.

One simple implementation is called: Bit-PLRU or Most Recently Used (MRU)-based

- An MRU bit is assigned to each cache line
- MRU bit is set to 1 upon a hit

Source:

- When there is a need for replacement, cache controller looks up to find first cache block with MRU = 0. When found, it is replaced and MRU is set to 1.
- When (almost) all blocks have MRU = 1, last set clears the MRU of the other blocks

1) Al-Zoubi H. et al, "Performance evaluation of cache replacement policies for the SPEC CPU2000 benchmark suite", ACM-SE 2004

28