# 

## VIRTUALIZING IO THROUGH THE IO MEMORY MANAGEMENT UNIT (IOMMU)

ANDY KEGEL, PAUL BLINZER, ARKA BASU, MAGGIE CHAN ASPLOS 2016

## WHAT THIS TUTORIAL WILL AND WILL NOT COVER

### ▲ Definition of "IO" or "Device" or "IO Device" :

- Traditional IO includes GPU for graphics, NIC, storage controller, USB controller, etc.
- New IO (accelerators) includes general-purpose computation on a GPU (GPGPU), encryption accelerators, digital signal processors, etc.

### Two Parts in Virtualizing an IO Device

#### - Device specific: Virtual instances of device

- Virtual functions and Physical function in devices (PCIE<sup>®</sup> SR-IOV, MR-IOV)

#### - System defined: IO Memory Management Unit or IOMMU

- Virtualizing DMA accesses (Address Translation and Protection)
- Virtualizing Interrupts (Interrupt Remapping and Virtualizing)

## WHAT THIS TUTORIAL WILL AND WILL NOT COVER

#### Definition of "IO" or "Device" or "IO Device" :

- Traditional IO includes GPU for graphics, NIC, storage controller, USB controller, etc.
- New IO (accelerators) includes general-purpose computation on a GPU (GPGPU), encryption accelerators, digital signal processors, etc.

### Two Parts in Virtualizing an IO Device

#### - System defined: IO Memory Management Unit or IOMMU

- Virtualizing DMA accesses (Address Translation and Protection)
- Virtualizing Interrupts (Interrupt Remapping and Virtualizing)



AGENDA





















## Tremendous growth in virtualization in server



Efficient access to IO under virtualization is important

Source: IDC Server Virtualization, MCS 2012



Hypervisor (a.k.a. VMM)

Hardware – CPU, Memory, IO



Hardware – CPU, Memory, IO



Hardware – CPU, Memory, IO





Isolation across Guest OS => No access to (system) physical address from Guest OS









23 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016



24 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016





## INTRODUCTION OF IOMMU: THE LOGICAL VIEW



## INTRODUCTION OF IOMMU: THE LOGICAL VIEW







29 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016

























40 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016



41 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016



#### INTRODUCTION OF IOMMU: THE LOGICAL VIEW ADDING INTERRUPT HANDLING CAPABILITY



#### INTRODUCTION OF IOMMU: THE LOGICAL VIEW ADDING INTERRUPT HANDLING CAPABILITY





45 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016



Shared virtual addressing is key to ease of programming



Shared virtual addressing is key to ease of programming



48 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016

Shared virtual addressing is key to ease of programming



Shared virtual addressing is key to ease of programming



#### INTRODUCTION OF IOMMU: THE LOGICAL VIEW ADDING ABILITY TO SHARE ADDRESS SPACE IN HETEROGENEOUS SYSTEM



#### **INTRODUCTION OF IOMMU: THE LOGICAL VIEW** ADDING ABILITY TO SHARE ADDRESS SPACE IN HETEROGENEOUS SYSTEM





52 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016

# INTRODUCTION OF IOMMU: (TYPICAL) PHYSICAL VIEW



### IOMMU FROM THE PERSPECTIVE OF DEVICE (PCIE® SPEC) AMD



### IOMMU FROM THE PERSPECTIVE OF DEVICE (PCIE® SPEC) AMD

IOMMU  $\rightarrow$  Translation Agent and uses the Address Translation and Protection Table



|                              | CPU MMU                                                                                              | ΙΟΜΜυ                                   |
|------------------------------|------------------------------------------------------------------------------------------------------|-----------------------------------------|
| Address Translation          | $\begin{array}{c} VA \rightarrow PA \text{ and } GVA \rightarrow \\ GPA \rightarrow SPA \end{array}$ | VA → PA and GVA<br>→ GPA → SPA          |
| Memory Protection            | Read/Write etc.                                                                                      | Read/Write etc.                         |
| Interrupt Handling           | Νο                                                                                                   | Remapping and<br>Virtualization Support |
| Parallelism                  | Mostly Single Threaded                                                                               | Highly Multithreaded                    |
| Page Faults, Events,<br>etc. | Synchronous Handling                                                                                 | Asynchronous Handling                   |



### IOMMU TECHNOLOGY FAMILIES REFERENCES







60 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016

### FIVE USE CASES OF IOMMU



HOW CAN AN IOMMU HELP?



**Physical Memory** 

▲ Many 32-bit DMA devices operate in a 64-bit system

 Older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), ...



HOW CAN AN IOMMU HELP?



#### **Physical Memory**

2<sup>64</sup>-1

2<sup>32</sup>-1

0

▲ Many 32-bit DMA devices operate in a 64-bit system

 Older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), ...



HOW CAN AN IOMMU HELP?

### 



- Older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), ...
- ▲ SW Solution: Bounce buffers
  - Device does DMA to a region in 32bit physical address, CPU copies data from buffer to the final destination

Device

#### **Physical Memory**



HOW CAN AN IOMMU HELP?

### 



- Older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), ...
- ▲ SW Solution: Bounce buffers
  - Device does DMA to a region in 32bit physical address, CPU copies data from buffer to the final destination

Device

#### **Physical Memory**



0



HOW CAN AN IOMMU HELP?

### 



- Older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), ...
- SW Solution: Bounce buffers
  - Device does DMA to a region in 32bit physical address, CPU copies data from buffer to the final destination





0



HOW CAN AN IOMMU HELP?

### 



▲ Many 32-bit DMA devices operate in a 64-bit system

- Older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), ...
- SW Solution: Bounce buffers
  - Device does DMA to a region in 32bit physical address, CPU copies data from buffer to the final destination

Device

HOW CAN AN IOMMU HELP?





▲ Many 32-bit DMA devices operate in a 64-bit system

- Older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), ...
- ▲ SW Solution: Bounce buffers
  - Device does DMA to a region in 32bit physical address, CPU copies data from buffer to the final destination

Device

HOW CAN AN IOMMU HELP?

### 



- Many 32-bit DMA devices operate in a 64-bit system
  - Older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), ...
- ▲ SW Solution: Bounce buffers
  - Device does DMA to a region in 32bit physical address, CPU copies data from buffer to the final destination

Device

- Slow, needs SW synchronization, ties up CPU core

HOW CAN AN IOMMU HELP?



#### **Physical Memory**

0

Many 32bit DMA devices operate in a 64bit system 2<sup>64</sup>-1 older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), ... IOMMU 2<sup>32</sup>-1 Translation 0x01020304 -> Device 0x208090A0B0C

HOW CAN AN IOMMU HELP?







- older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), ...
- Better solution: IOMMU remaps 32bit device physical address to system physical address beyond 32bit



HOW CAN AN IOMMU HELP?







- older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), ...
- Better solution: IOMMU remaps 32bit device physical address to system physical address beyond 32bit



HOW CAN AN IOMMU HELP?







- older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), ...
- Better solution: IOMMU remaps 32bit device physical address to system physical address beyond 32bit



# SUPPORTING LEGACY DEVICES

HOW CAN AN IOMMU HELP?





2<sup>64</sup>-1



- older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), ...
- Better solution: IOMMU remaps 32bit device physical address to system physical address beyond 32bit



# SUPPORTING LEGACY DEVICES

HOW CAN AN IOMMU HELP?







 older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), ...

Device

- Better solution: IOMMU remaps 32bit device physical address to system physical address beyond 32bit
  - DMA goes directly into 64bit memory
  - No CPU transfer
  - More efficient

# SUPPORTING LEGACY DEVICES

HOW CAN AN IOMMU HELP?





Many 32bit DMA devices operate in a 64bit system

 older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), ...

Device

- Better solution: IOMMU remaps 32bit device physical address to system physical address beyond 32bit
  - DMA goes directly into 64bit memory
  - No CPU transfer
  - More efficient
- Linux: DMA redirect feature



# IOMMU USECASE: SECURITY AND PROTECTION SECURE BOOT



**Physical Memory** 

DMA devices use physical addresses on the system bus to read and write memory based on SW driver or OS instructions

> Passwords, Critical data





**Physical Memory** 

DMA devices use physical addresses on the system bus to read and write memory based on SW driver or OS instructions

> Passwords, Critical data



# 

# DMA devices use physical addresses on the system bus to read and write memory based on SW driver or OS instructions

- SW bugs or attacks by malicious applications could access and modify important OS data (OS security policy, passwords,...)
  - Without OS able to detect or prevent the access as it can for CPU
  - Latent problem until it shows unexpectedly possibly much later





Passwords, Critical data

# 

### DMA devices use physical addresses on the system bus to read and write memory based on SW driver or OS instructions

- SW bugs or attacks by malicious applications could access and modify important OS data (OS security policy, passwords,...)
  - Without OS able to detect or prevent the access as it can for CPU
  - Latent problem until it shows unexpectedly possibly much later



Passwords, Critical data



# 

Physical Memory

DMA devices use physical addresses on the system bus to read and write memory based on SW driver or OS instructions

- SW bugs or attacks by malicious applications could access and modify important OS data (OS security policy, passwords,...)
  - Without OS able to detect or prevent the access as it can for CPU
  - Latent problem until it shows unexpectedly possibly much later
- ▲ This affects system stability, if just the right data is hit
  - "Heisenbugs" are sometimes caused by bugs in system drivers
- Or it allows malicious driver attacks to take over the system

Passwords, Critical data

- DMA devices assert physical addresses on the system bus to read and write memory based on SW driver or OS settings
- SW bugs or attacks by malicious applications could access and modify important data (OS security policy, passwords,...)

Physical Memory

Passwords, critical data

I/O buffer



Х

OK

83 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016

#### DMA devices assert physical addresses on the system bus to read and write memory based on SW driver or OS settings

- SW bugs or attacks by malicious applications could access and modify important data (OS security policy, passwords,...)
- The IOMMU allows OS to enforce DMA access policy for any DMA capable device accessing physical memory
  - Memory state important to stability/security
  - If access occurs, OS gets notified and can shut the device & driver down and notifies the user or administrator

L)ev

X OK

Range check



Physical Memory

Passwords, critical data



#### DMA devices assert physical addresses on the system bus to read and write memory based on SW driver or OS settings

- SW bugs or attacks by malicious applications could access and modify important data (OS security policy, passwords,...)
- The IOMMU allows OS to enforce DMA access policy for any DMA capable device accessing physical memory
  - Memory state important to stability/security
  - If access occurs, OS gets notified and can shut the device & driver down and notifies the user or administrator



**Physical Memory** 

Passwords, critical data

I/O buffer

### DMA devices assert physical addresses on the system bus to read and write memory based on SW driver or OS settings

- SW bugs or attacks by malicious applications could access and modify important data (OS security policy, passwords,...)
- The IOMMU allows OS to enforce DMA access policy for any DMA capable device accessing physical memory
  - Memory state important to stability/security
  - If access occurs, OS gets notified and can shut the device & driver down and notifies the user or administrator



**Physical Memory** 

### DMA devices assert physical addresses on the system bus to read and write memory based on SW driver or OS settings

- SW bugs or attacks by malicious applications could access and modify important data (OS security policy, passwords,...)
- The IOMMU allows OS to enforce DMA access policy for any DMA capable device accessing physical memory
  - Memory state important to stability/security
  - If access occurs, OS gets notified and can shut the device & driver down and notifies the user or administrator



# Physical Memory

#### DMA devices assert physical addresses on the system bus to read and write memory based on SW driver or OS settings

- SW bugs or attacks by malicious applications could access and modify important data (OS security policy, passwords,...)
- The IOMMU allows OS to enforce DMA access policy for any DMA capable device accessing physical memory
  - Memory state important to stability/security
  - If access occurs, OS gets notified and can shut the device & driver down and notifies the user or administrator



#### **Physical Memory**

Passwords, critical data

- Ensuring that a system is not doing more than it's supposed to
  - e.g., being part of a botnet, provide banking data or other personal info to impersonators or other attackers
  - The earliest time for attack and defense is at firmware startup
  - From there critical memory regions are protected from invalid access



- Ensuring that a system is not doing more than it's supposed to
  - e.g., being part of a botnet, provide banking data or other personal info to impersonators or other attackers
  - The earliest time for attack and defense is at firmware startup
  - From there critical memory regions are protected from invalid access
- The Secure Boot architecture ensures that no non-vetted OS kernel code runs on the system, changing critical settings



- Ensuring that a system is not doing more than it's supposed to
  - e.g., being part of a botnet, provide banking data or other personal info to impersonators or other attackers
  - The earliest time for attack and defense is at firmware startup
  - From there critical memory regions are protected from invalid access
- The Secure Boot architecture ensures that no non-vetted OS kernel code runs on the system, changing critical settings
- Some I/O devices can issue DMA requests to system memory directly, without OS or Firmware intervention
  - e.g.,1394/Firewire, network cards, as part of network boot
  - That allows attacks to modify memory before even the OS has a chance to protect against the attacks



- Ensuring that a system is not doing more than it's supposed to
  - e.g., being part of a botnet, provide banking data or other personal info to impersonators or other attackers
  - The earliest time for attack and defense is at firmware startup
  - From there critical memory regions are protected from invalid access
- The Secure Boot architecture ensures that no non-vetted OS kernel code runs on the system, changing critical settings
- Some I/O devices can issue DMA requests to system memory directly, without OS or Firmware intervention
  - e.g.,1394/Firewire, network cards, as part of network boot
  - That allows attacks to modify memory before even the OS has a chance to protect against the attacks
- As outlined earlier, using the IOMMU prevents DMA access to important memory regions





# IOMMU USECASE: EFFICIENT IO IN VIRTUALIZED ENVIRONMENT

# BACKGROUND: TRADITIONAL DMA BY IO (NO SYSTEM VIRTUALIZATION)



94 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016

# BACKGROUND: TRADITIONAL DMA BY IO (NO SYSTEM VIRTUALIZATION)



# **BACKGROUND: TRADITIONAL DMA BY IO**

(NO SYSTEM VIRTUALIZATION)



# BACKGROUND: TRADITIONAL DMA BY IO

(NO SYSTEM VIRTUALIZATION)



# BACKGROUND: TRADITIONAL DMA BY IO

(NO SYSTEM VIRTUALIZATION) Setup **Device Driver** Core Core **IO Device IO Device** Virtual **Protection Addresses** Check MMU **MMU Physical Addresses Memory** 



99 IOMMU TUTORIAL @ ASPLOS | 3<sup>RD</sup> APRIL 2016





- Each OS assumes full access to the platform hardware
  - Memory, Interrupts, Devices, CPU cores, etc.



- Each OS assumes full access to the platform hardware
  Memory, Interrupts, Devices, CPU cores, etc.
- A Virtual Machine Manager (VMM) or Hypervisor (HV) is tasked to manage the physical hardware and define a "virtual machine" (VM) that represents the resources an OS expects to find in the system



- Each OS assumes full access to the platform hardware
  Memory, Interrupts, Devices, CPU cores, etc.
- A Virtual Machine Manager (VMM) or Hypervisor (HV) is tasked to manage the physical hardware and define a "virtual machine" (VM) that represents the resources an OS expects to find in the system





- Each OS assumes full access to the platform hardware
  Memory, Interrupts, Devices, CPU cores, etc.
- A Virtual Machine Manager (VMM) or Hypervisor (HV) is tasked to manage the physical hardware and define a "virtual machine" (VM) that represents the resources an OS expects to find in the system



▲ Use cases:



- Each OS assumes full access to the platform hardware
  Memory, Interrupts, Devices, CPU cores, etc.
- A Virtual Machine Manager (VMM) or Hypervisor (HV) is tasked to manage the physical hardware and define a "virtual machine" (VM) that represents the resources an OS expects to find in the system





- Each OS assumes full access to the platform hardware
  Memory, Interrupts, Devices, CPU cores, etc.
- A Virtual Machine Manager (VMM) or Hypervisor (HV) is tasked to manage the physical hardware and define a "virtual machine" (VM) that represents the resources an OS expects to find in the system



- System consolidation
- OS/application compatibility





- Each OS assumes full access to the platform hardware
  Memory, Interrupts, Devices, CPU cores, etc.
- A Virtual Machine Manager (VMM) or Hypervisor (HV) is tasked to manage the physical hardware and define a "virtual machine" (VM) that represents the resources an OS expects to find in the system





- Each OS assumes full access to the platform hardware
  Memory, Interrupts, Devices, CPU cores, etc.
- A Virtual Machine Manager (VMM) or Hypervisor (HV) is tasked to manage the physical hardware and define a "virtual machine" (VM) that represents the resources an OS expects to find in the system



Most CPUs today have support for system virtualization

 Nested page tables (HV & OS levels), allow VMM/HV to assign and manage system memory and interrupts to Virtual Machines

Most CPUs today have support for system virtualization

- Nested page tables (HV & OS levels), allow VMM/HV to assign and manage system memory and interrupts to Virtual Machines
- ▲ I/O devices are typically managed by HV/VMM software, either by...

Most CPUs today have support for system virtualization

- Nested page tables (HV & OS levels), allow VMM/HV to assign and manage system memory and interrupts to Virtual Machines
- ▲ I/O devices are typically managed by HV/VMM software, either by...

### **Para-Virtualization**

Guest device driver uses HV "hypercalls" Hypervisor manages HW operation (DMA)

Hypervisor SW validates and redirects I/O requests from Guest OS (overhead, slow)

Hypervisor arbitrates and schedules requests from multiple guest OS, allows VM migration

Most common operation for today's virtualization Software Works well for CPU-heavy workloads I/O, graphics or compute-heavy workloads

Most CPUs today have support for system virtualization

- Nested page tables (HV & OS levels), allow VMM/HV to assign and manage system memory and interrupts to Virtual Machines
- ▲ I/O devices are typically managed by HV/VMM software, either by...

| Para-Virtualization                                                                                                                            | Direct-Mapped Device & SR-IOV                                                                                                                                                                                                           |
|------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Guest device driver uses HV "hypercalls"<br>Hypervisor manages HW operation (DMA)                                                              | Device function is mapped to guest OS<br>Guest OS uses native HW drivers                                                                                                                                                                |
| Hypervisor SW validates and redirects I/O requests from Guest OS (overhead, slow)                                                              | Physical Device DMA must be limited and redirected by Hypervisor (via IOMMU),                                                                                                                                                           |
| Hypervisor arbitrates and schedules requests from multiple guest OS, allows VM migration                                                       | One device function per guest OS, physical memory must be committed                                                                                                                                                                     |
| Most common operation for today's<br>virtualization Software<br>Works well for CPU-heavy workloads<br>I/O, graphics or compute-heavy workloads | <ul><li>I/O device must be resettable by HV when<br/>guest error puts it in undefined state</li><li>SR-IOV is a variant of direct mapped</li><li>I/O device provides 1 - n "virtual" devices in</li><li>HW (PCI-SIG standard)</li></ul> |

## EFFICIENT I/O VIRTUALIZATION HARDWARE IMPLEMENTED TECHNIQUE THROUGH IOMMU

IOMMU validates DMA accesses and validates device interrupts



Using the IOMMU allows a Hypervisor to assign a physical device exclusively to a Guest VM without danger of memory corruption to other VMs



- Using the IOMMU allows a Hypervisor to assign a physical device exclusively to a Guest VM without danger of memory corruption to other VMs
  - Beneficial if one VM requires near native performance

- Using the IOMMU allows a Hypervisor to assign a physical device exclusively to a Guest VM without danger of memory corruption to other VMs
  - Beneficial if one VM requires near native performance
  - Or if OS needs to be "sandboxed" (because of suspected malware)

- Using the IOMMU allows a Hypervisor to assign a physical device exclusively to a Guest VM without danger of memory corruption to other VMs
  - Beneficial if one VM requires near native performance
  - Or if OS needs to be "sandboxed" (because of suspected malware)
- A Native driver can operate in the Guest OS

- Using the IOMMU allows a Hypervisor to assign a physical device exclusively to a Guest VM without danger of memory corruption to other VMs
  - Beneficial if one VM requires near native performance
  - Or if OS needs to be "sandboxed" (because of suspected malware)
- A Native driver can operate in the Guest OS
- IOMMU enforces Hypervisor policy on memory and system resource isolation for each of the Guest Virtual Machines

- Using the IOMMU allows a Hypervisor to assign a physical device exclusively to a Guest VM without danger of memory corruption to other VMs
  - Beneficial if one VM requires near native performance
  - Or if OS needs to be "sandboxed" (because of suspected malware)
- ▲ Native driver can operate in the Guest OS
- IOMMU enforces Hypervisor policy on memory and system resource isolation for each of the Guest Virtual Machines
- IOMMU redirects device physical address set up by Guest OS driver (= Guest Physical Addresses) to the actual Host System Physical Address (SPA)

- Using the IOMMU allows a Hypervisor to assign a physical device exclusively to a Guest VM without danger of memory corruption to other VMs
  - Beneficial if one VM requires near native performance
  - Or if OS needs to be "sandboxed" (because of suspected malware)
- ▲ Native driver can operate in the Guest OS
- IOMMU enforces Hypervisor policy on memory and system resource isolation for each of the Guest Virtual Machines
- IOMMU redirects device physical address set up by Guest OS driver (= Guest Physical Addresses) to the actual Host System Physical Address (SPA)
  - Useful for platform resources that have "well-known" addresses like legacy devices or system resources like APIC (Advanced Programmable Interrupt Controller)

- Using the IOMMU allows a Hypervisor to assign a physical device exclusively to a Guest VM without danger of memory corruption to other VMs
  - Beneficial if one VM requires near native performance
  - Or if OS needs to be "sandboxed" (because of suspected malware)
- ▲ Native driver can operate in the Guest OS
- IOMMU enforces Hypervisor policy on memory and system resource isolation for each of the Guest Virtual Machines
- IOMMU redirects device physical address set up by Guest OS driver (= Guest Physical Addresses) to the actual Host System Physical Address (SPA)
  - Useful for platform resources that have "well-known" addresses like legacy devices or system resources like APIC (Advanced Programmable Interrupt Controller)
- Allows near-native device performance for high-performance devices with low system impact



## IOMMU USECASE: ENABLING HETEROGENEOUS COMPUTING

The limiters that need to be fixed to unleash programmers:



The limiters that need to be fixed to unleash programmers:

Multiple memory pools, multiple address spaces



The limiters that need to be fixed to unleash programmers:

- Multiple memory pools, multiple address spaces
- ▲ High overhead dispatch, data copies across PCIe



The limiters that need to be fixed to unleash programmers:

- Multiple memory pools, multiple address spaces
- High overhead dispatch, data copies across PCIe
- ▲ New languages and APIs for GPU programming necessary (OpenCL, etc.)



The limiters that need to be fixed to unleash programmers:

- Multiple memory pools, multiple address spaces
- High overhead dispatch, data copies across PCIe
- ▲ New languages and APIs for GPU programming necessary (OpenCL, etc.)
  - And sometimes proprietary environments



The limiters that need to be fixed to unleash programmers:

- Multiple memory pools, multiple address spaces
- High overhead dispatch, data copies across PCIe
- New languages and APIs for GPU programming necessary (OpenCL, etc.)
   And sometimes proprietary environments
- ➔ Dual source development



The limiters that need to be fixed to unleash programmers:

- Multiple memory pools, multiple address spaces
- High overhead dispatch, data copies across PCIe
- New languages and APIs for GPU programming necessary (OpenCL, etc.)
   And sometimes proprietary environments
- ➔ Dual source development
- Expert programmers only



▲ Some memory copies are gone, because the same memory is accessed



Some memory copies are gone, because the same memory is accessed

– But the memory is not accessible concurrently, because of cache policies



- Some memory copies are gone, because the same memory is accessed
  But the memory is not accessible concurrently, because of cache policies
- Two memory pools remain (cache coherent + non-coherent memory regions)



- Some memory copies are gone, because the same memory is accessed
  But the memory is not accessible concurrently, because of cache policies
- Two memory pools remain (cache coherent + non-coherent memory regions)



- Some memory copies are gone, because the same memory is accessed
  But the memory is not accessible concurrently, because of cache policies
- Two memory pools remain (cache coherent + non-coherent memory regions)
- ▲ Jobs are still queued through the OS driver chain and suffer from overhead



- Some memory copies are gone, because the same memory is accessed
  But the memory is not accessible concurrently, because of cache policies
- Two memory pools remain (cache coherent + non-coherent memory regions)
- ▲ Jobs are still queued through the OS driver chain and suffer from overhead
- ▲ Still requires expert programmers to get performance



- Some memory copies are gone, because the same memory is accessed
  But the memory is not accessible concurrently, because of cache policies
- Two memory pools remain (cache coherent + non-coherent memory regions)
- Jobs are still queued through the OS driver chain and suffer from overhead
- ▲ Still requires expert programmers to get performance
- This is only an intermediate step in the journey





### ▲ Unified Coherent Memory enables data sharing across all processors





### Unified Coherent Memory enables data sharing across all processors



### ▲ Unified Coherent Memory enables data sharing across all processors

Processors architected to operate cooperatively



▲ Unified Coherent Memory enables data sharing across all processors

- Processors architected to operate cooperatively
  - Can exchange data "on the fly", similar to what CPU threads do



Unified Coherent Memory enables data sharing across all processors

- Processors architected to operate cooperatively
  - Can exchange data "on the fly", similar to what CPU threads do
  - The lower job dispatch overhead allows tasks to be handled by the GPU that previously were "too costly" to transfer over
- Designed to enable the application running on different processors without substantially changing the programming logic



### IOMMU: A BUILDING BLOCK FOR HSA REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS



The goals of the Heterogeneous System Architecture (HSA) and where the IOMMU helps:

## IOMMU: A BUILDING BLOCK FOR HSA REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS



The goals of the Heterogeneous System Architecture (HSA) and where the IOMMU helps:

Use of accelerators as a first-class, peer processor within the system

## IOMMU: A BUILDING BLOCK FOR HSA REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS



The goals of the Heterogeneous System Architecture (HSA) and where the IOMMU helps:

- Use of accelerators as a first-class, peer processor within the system
  - Unified process address space access across all processors
    - Shared Virtual Memory (SVM), "GPU ptr == CPU ptr"

The goals of the Heterogeneous System Architecture (HSA) and where the IOMMU helps:

- Use of accelerators as a first-class, peer processor within the system
  - Unified process address space access across all processors
    - Shared Virtual Memory (SVM), "GPU ptr == CPU ptr"



The goals of the Heterogeneous System Architecture (HSA) and where the IOMMU helps:

- Use of accelerators as a first-class, peer processor within the system
  - Unified process address space access across all processors
    - Shared Virtual Memory (SVM), "GPU ptr == CPU ptr"
  - Accelerator operates in pageable system memory\*



The goals of the Heterogeneous System Architecture (HSA) and where the IOMMU helps:

- Use of accelerators as a first-class, peer processor within the system
  - Unified process address space access across all processors
    - Shared Virtual Memory (SVM), "GPU ptr == CPU ptr"
  - Accelerator operates in pageable system memory\*



The goals of the Heterogeneous System Architecture (HSA) and where the IOMMU helps:

- Use of accelerators as a first-class, peer processor within the system
  - Unified process address space access across all processors
    - Shared Virtual Memory (SVM), "GPU ptr == CPU ptr"
  - Accelerator operates in pageable system memory\*
  - Cache coherency between the CPU and accelerator caches
  - User mode dispatch/scheduling reduces job-dispatch overhead
  - QoS with preemption/context switch of GPU Compute Units



The goals of the Heterogeneous System Architecture (HSA) and where the IOMMU helps:

- Use of accelerators as a first-class, peer processor within the system
  - Unified process address space access across all processors
    - Shared Virtual Memory (SVM), "GPU ptr == CPU ptr"
  - Accelerator operates in pageable system memory\*
  - Cache coherency between the CPU and accelerator caches
  - User mode dispatch/scheduling reduces job-dispatch overhead
  - QoS with preemption/context switch of GPU Compute Units
- The IOMMU enforces control of GPU access to memory



The goals of the Heterogeneous System Architecture (HSA) and where the IOMMU helps:

- Use of accelerators as a first-class, peer processor within the system
  - Unified process address space access across all processors
    - Shared Virtual Memory (SVM), "GPU ptr == CPU ptr"
  - Accelerator operates in pageable system memory\*
  - Cache coherency between the CPU and accelerator caches
  - User mode dispatch/scheduling reduces job-dispatch overhead
  - QoS with preemption/context switch of GPU Compute Units

#### ▲ The IOMMU enforces control of GPU access to memory

 OS can efficiently and safely share process page tables with accelerators (requires ATS/PRI protocol support)



The goals of the Heterogeneous System Architecture (HSA) and where the IOMMU helps:

- Use of accelerators as a first-class, peer processor within the system
  - Unified process address space access across all processors
    - Shared Virtual Memory (SVM), "GPU ptr == CPU ptr"
  - Accelerator operates in pageable system memory\*
  - Cache coherency between the CPU and accelerator caches
  - User mode dispatch/scheduling reduces job-dispatch overhead
  - QoS with preemption/context switch of GPU Compute Units

#### ▲ The IOMMU enforces control of GPU access to memory

- OS can efficiently and safely share process page tables with accelerators (requires ATS/PRI protocol support)
- Accelerators can't step outside of the OS-set boundaries

152 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016



GPU visible CPU visible System Physical HSA MMU MMU Process Virtual Address Space Process VA Space Address Space Mapped via Mapped via 2<sup>47</sup>-1 2<sup>47</sup>-1 CPU MMU HSA MMU 0x12340000 0x12340000 0x0000000 0×00000000

The benefits of the Heterogeneous System Architecture:

### 

#### The benefits of the Heterogeneous System Architecture:

Pageable memory access is validated and handled directly by the OS memory manager via AMD IOMMU





#### The benefits of the Heterogeneous System Architecture:

- Pageable memory access is validated and handled directly by the OS memory manager via AMD IOMMU
- Application data structures can be directly parsed by the accelerator and pointer links followed without CPU help



# 

#### The benefits of the Heterogeneous System Architecture:

- Pageable memory access is validated and handled directly by the OS memory manager via AMD IOMMU
- Application data structures can be directly parsed by the accelerator and pointer links followed without CPU help
- Common high level languages and tools (compilers, runtimes, ...) port easily to accelerators



#### The benefits of the Heterogeneous System Architecture:

- Pageable memory access is validated and handled directly by the OS memory manager via AMD IOMMU
- Application data structures can be directly parsed by the accelerator and pointer links followed without CPU help
- Common high level languages and tools (compilers, runtimes, ...) port easily to accelerators
  - C/C++, Python, Java, ... already have open source implementations



#### The benefits of the Heterogeneous System Architecture:

- Pageable memory access is validated and handled directly by the OS memory manager via AMD IOMMU
- Application data structures can be directly parsed by the accelerator and pointer links followed without CPU help
- Common high level languages and tools (compilers, runtimes, ...) port easily to accelerators
  - C/C++, Python, Java, ... already have open source implementations
  - Many more languages to follow



#### The benefits of the Heterogeneous System Architecture:

- Pageable memory access is validated and handled directly by the OS memory manager via AMD IOMMU
- Application data structures can be directly parsed by the accelerator and pointer links followed without CPU help
- Common high level languages and tools (compilers, runtimes, ...) port easily to accelerators
  - C/C++, Python, Java, ... already have open source implementations
  - Many more languages to follow
- IOMMU making it easier for programmers to use GPUs and other accelerators safely and efficiently



▲ Goal of the software stack is to focus on high-level language support

**HSA Software Stack** 



▲ Goal of the software stack is to focus on high-level language support



© Copyright 2014 HSA Foundation. All Rights Reserved.

▲ Goal of the software stack is to focus on high-level language support

Allow to target the GPU directly by SW



 $\ensuremath{\mathbb{C}}$  Copyright 2014 HSA Foundation. All Rights Reserved.

▲ Goal of the software stack is to focus on high-level language support

- Allow to target the GPU directly by SW
- Drivers are setting up the HW and policies, then go out of the way



© Copyright 2014 HSA Foundation. All Rights Reserved.

Goal of the software stack is to focus on high-level language support

- Allow to target the GPU directly by SW
- Drivers are setting up the HW and policies, then go out of the way
- IOMMU support provide hardware enforced protections for Operating System



HSA Software Stack

© Copyright 2014 HSA Foundation. All Rights Reserved.

# LINES-OF-CODE AND PERFORMANCE COMPARISONS



AMD A10-5800K APU with Radeon<sup>™</sup> HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM. Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL<sup>™</sup> 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta

© Copyright 2014 HSA Foundation. All Rights Reserved.



AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM. Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta © Copyright 2014 HSA Foundation. All Rights Reserved.

# ACCELERATORS: THE PORTABILITY CHALLENGE

#### CPU ISAs

- ISA innovations added incrementally (i.e., NEON, AVX, etc)
  - ISA retains backwards-compatibility with previous generation
- Two dominant instruction-set architectures: ARM and x86

### GPU ISAs

- Massive diversity of architectures in the market
  - Each vendor has its own ISA and often several in the market at same time
- No commitment (or attempt!) to provide any backwards compatibility
  - Traditionally graphics APIs (OpenGL, DirectX) provide necessary abstraction

# WHAT IS HSA INTERMEDIATE LANGUAGE (HSAIL)?

▲ Intermediate language for parallel compute in HSA

- Generated by a "High Level Compiler" (GCC, LLVM, Java VM, etc.)
- Expresses parallel regions of code
- Binary format of HSAIL is called "BRIG"
- Goal: Bring parallel acceleration to mainstream programming languages
- IOMMU based pointer translation is key to enabling an efficient IL Implementation



© Copyright 2014 HSA Foundation. All Rights Reserved.

# MEMBERS DRIVING HAS FOUNDATION

http://www.hsafoundation.com/



# GEN1: FIR & AES

- FIR is a memory-intensive streaming workload
- AES is a compute-intensive streaming workload
- CL12 cl\_mem buffer
  - Copy to/from the device
- CL20 SVM buffer Coarse Grain Sync
  - Copy to/from SVM
  - Data copy cannot be avoided, since the space for SVM is limited
- HSA Unified Memory Space Fine Grained Sync
  - Regular pointer
  - No explicit copy
- Results
  - HSA compute abstraction
  - NO performance penalty
- Not all algorithms run faster
  - Measured on Kaveri (A pre-HSA 1.0 device)
  - Limited Coherent throughput



Saoni Mukherjee, Yifan Sun, Paul Blinzer, Amir Kavyan Ziabari, David Kaeli, *A Comprehensive Performance Analysis of HSA and OpenCL 2.0,* **Proceedings of the 2016 International Symposium on Program Analysis and System Software,** April 2016, to appear.

# BLACKSCHOLES

### ▲ C++ on HSA

- Matches or outperforms OpenCL
- Course Grained SVM
  - Matches OpenCL buffers for bandwidth
  - More predictable performance

#### Fine Grained SVM

- Faster kernel dispatch
- Larger allocations
- Shared data structure

### Results

- HSA compute abstraction
- NO performance penalty

#### SOURCE: RALPH POTTER - CODEPLAY. PRESENTATION MADE TO SG14 C++ WORKGROUP



### ENABLING HETEROGENEOUS COMPUTING SUMMARY AND DEMONSTRATION



#### Key Takeaways:

- To further scale up compute performance, software must take better advantage of system accelerators like GPUs and DSPs in high level languages
- Accelerators following the HSA Foundation specification requirements allow programmers to write or port programs easily using common high level languages
- AMD IOMMU is key to efficiently and safely access process virtual memory!
  - Does translation of both process address space via PASID and device physical accesses
  - Enforces OS allocation policy, deals with virtual memory page faults, and much more

### AGENDA



173 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016

# **RECAP: IOMMU AND ITS CAPABILITIES**



# AGENDA: WHAT IS COMING UP?



DMA Address Translation

- Address translation and memory protection in un-virtualized System
- Making address translation faster through caching
- Enabling shared address space in heterogeneous system
- Enabling pre-translation through IOMMU
- Enabling demand paging from devices (dynamic page fault)
- Nested address translation in virtualized system
- Invalidating IOMMU mappings

Address translation, memory protection, HSA

# AGENDA: WHAT IS COMING UP?



DMA Address Translation

- Address translation and memory protection in un-virtualized System
- Making address translation faster through caching
- Enabling shared address space in heterogeneous system
- Enabling pre-translation through IOMMU
- Enabling demand paging from devices (dynamic page fault)
- Nested address translation in virtualized system
- Invalidating IOMMU mappings
- Interrupt Handling
  - Interrupt filtering and remapping
  - Interrupt virtualization

**Address** translation, memory protection, **HSA** Interrupts

# AGENDA: WHAT IS COMING UP?



DMA Address Translation

- Address translation and memory protection in un-virtualized System
- Making address translation faster through caching
- Enabling shared address space in heterogeneous system
- Enabling pre-translation through IOMMU
- Enabling demand paging from devices (dynamic page fault)
- Nested address translation in virtualized system
- Invalidating IOMMU mappings
- Interrupt Handling
  - Interrupt filtering and remapping
  - Interrupt virtualization

### Summary

- A peek inside a typical IOMMU implementation
- Data structures and their Interactions

Address translation, memory protection, HSA

Interrupts

# IOMMU Internals: Address Translation and Memory Protection

178 IOMMU TUTORIAL @ ASPLOS | 3<sup>RD</sup> APRIL 2016

# ADDRESS TRANSLATION AND MEMORY PROTECTION NON-VIRTUALIZED SYSTEM



# ADDRESS TRANSLATION AND MEMORY PROTECTION AMD



### ADDRESS TRANSLATION AND MEMORY PROTECTION NON-VIRTUALIZED SYSTEM



#### ADDRESS TRANSLATION AND MEMORY PROTECTION NON-VIRTUALIZED SYSTEM



#### ADDRESS TRANSLATION AND MEMORY PROTECTION NON-VIRTUALIZED SYSTEM



#### ADDRESS TRANSLATION AND MEMORY PROTECTION NON-VIRTUALIZED SYSTEM



## MAKING TRANSLATION FAST

#### CACHING TRANSLATION IN IOMMU





185 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016

# IOMMU Internals: Enabling "Pointer-is-a-Pointer" in Heterogeneous Systems

186 IOMMU TUTORIAL @ ASPLOS | 3<sup>RD</sup> APRIL 2016

ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS



ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS



#### SHARING ADDRESS SPACE WITH CPU ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS **Process** Domain Core **IO Device GPU** Virtual DMA **Addresses** Virtual Request **Address MMU MMU IOMMU Physical Physical Addresses Addresses** DevĮD **Memory Device Table** Page Table

#### SHARING ADDRESS SPACE WITH CPU ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS Process Domain Core **IO Device GPU** Virtual DMA **Addresses** Virtual Request **Address MMU** MMU IOMMU **Physical Physical Addresses Addresses** DevĮD **Memory Device Table** x86-64 Page Table

#### SHARING ADDRESS SPACE WITH CPU ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS Process 0 **Process 1** Domain **IO Device GPU** Virtual DMA Addresses Virtual Request **Address MMU** MMU IOMMU **Physical Physical Addresses Addresses** DevlD **Memory Device Table** x86-64 Page Table

#### 191 IOMMU TUTORIAL @ ASPLOS | 3<sup>RD</sup> APRIL 2016

#### SHARING ADDRESS SPACE WITH CPU ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS Process 0 **Process** 1 Domain **IO Device GPU** Virtual DMA **Addresses** Virtual Request **Address MMU** MMU IOMMU Physical **Physical Addresses Addresses** Needs ability to identify more than one address space



#### SHARING ADDRESS SPACE WITH CPU **ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS Process 0 Process 1** Domain **IO Device GPU** Virtual **DMA DeviceID** Addresses Virtual Request **Address MMU** MMU IOMMU **Physical Physical Addresses Addresses** DevĮD Memory **Device Table**

ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS



ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS



ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS



ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS



197 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016

## IOMMU Internals: Enabling Translation Caching in Devices

198 IOMMU TUTORIAL @ ASPLOS | 3<sup>RD</sup> APRIL 2016

ENABLING MORE CAPABLE DEVICE/ACCELERATORS



199 IOMMU TUTORIAL @ ASPLOS | 3<sup>RD</sup> APRIL 2016

ENABLING MORE CAPABLE DEVICE/ACCELERATORS



Locally caching address translation in device reduces trips to IOMMU



ENABLING MORE CAPABLE DEVICE/ACCELERATORS



**IOMMU** driver assigns per-translation capability to devices



ENABLING MORE CAPABLE DEVICE/ACCELERATORS



ENABLING MORE CAPABLE DEVICE/ACCELERATORS



ENABLING MORE CAPABLE DEVICE/ACCELERATORS



ENABLING MORE CAPABLE DEVICE/ACCELERATORS



ENABLING MORE CAPABLE DEVICE/ACCELERATORS



ENABLING MORE CAPABLE DEVICE/ACCELERATORS



# IOMMU Internals: Enabling Demand Paging from IO → No Need to Pin Memory

































SERVICING DEVICE PAGE FAULT



SERVICING DEVICE PAGE FAULT





220 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016



221 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016

\*PPR= Page Peripheral Request



**Device Table** 

222 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016

PASID dID Addr Flag

\*PPR= Page Peripheral Request

gCR3 table



223 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016





225 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016



226 IOMMU TUTORIAL @ ASPLOS | 3<sup>RD</sup> APRIL 2016

\*PPR= Page Peripheral Request

SERVICING DEVICE PAGE FAULT



#### SERVICING DEVICE PAGE FAULT



#### SERVICING DEVICE PAGE FAULT



SERVICING DEVICE PAGE FAULT



#### SERVICING DEVICE PAGE FAULT



# IOMMU Internals: Nested (Two-Level) Address Translation

232 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016

## RECAP: ADDRESS TRANSLATION IN VIRTUALIZED SYSTEMS AMD



233 IOMMU TUTORIAL @ ASPLOS | 3<sup>RD</sup> APRIL 2016





235 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016

**Device** Table



## NESTED ADDRESS TRANSLATION BY IOMMU



# IOMMU Internals: Sending Commands to IOMMU

238 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016

## COMMANDS TO IOMMU

▲ IOMMU Driver (running on CPU) issues commands to IOMMU

- e.g., Invalidate IOMMU TLB Entry, Invalidate IOTLB Entry
- e.g., Invalidate Device Table Entry
- e.g., Complete PPR, Completion Wait , etc.

#### ▲ Issued via Command Buffer

- Memory resident circular buffer
- MMIO registers: Base, Head, and Tail register

## COMMANDS TO IOMMU

▲ IOMMU Driver (running on CPU) issues commands to IOMMU

- e.g., Invalidate IOMMU TLB Entry, Invalidate IOTLB Entry
- e.g., Invalidate Device Table Entry
- e.g., Complete PPR, Completion Wait , etc.

### ▲ Issued via Command Buffer

- Memory resident circular buffer
- MMIO registers: Base, Head, and Tail register



### ▲ IOMMU TLB Shootdown

- Update page table information
- Flush TLB Entry(s) containing stale information

### ▲ Three steps in IOMMU TLB shootdown

- Invalidating IOMMU TLB entry
- Invalidating IO TLB (Device TLB) entry
- Wait for completion



















### **IOMMU** Driver Core Core **IO Device** TLB Device Table **Entry Cache Translation Lookaside Buffer MMU MMU** Page Table walker **Command Buffer**































## IOMMU INTERNALS: INTERRUPT REMAPPING AND VIRTUALIZATION



263 IOMMU TUTORIAL @ ASPLOS | 3<sup>RD</sup> APRIL 2016









267 IOMMU TUTORIAL @ ASPLOS | 3<sup>RD</sup> APRIL 2016





269 IOMMU TUTORIAL @ ASPLOS | 3<sup>RD</sup> APRIL 2016



270 IOMMU TUTORIAL @ ASPLOS | 3<sup>RD</sup> APRIL 2016

























Guest OS 0 vAPIC Activate Core Core **IO Device IO Device** Target APIC APIC Guest VMM Guest Virtualized Interrupt MMU MMU **Memory** 

Guest OS 0 VAPIC Interrupt Core Core **IO Device IO Device** Guest APIC APIC vAPIC VMM Guest Virtualized Interrupt **MMU** MMU **Memory** 



# IOMMU INTERNALS: A TYPICAL IOMMU HARDWARE DESIGN

#### EXAMPLE OF IOMMU HARDWARE DESIGN



# CACHE SIZING VS PRODUCT TYPE

- Typical Client Product
  - Non-Virtualized
  - I/O Isolation
  - Small Working Set



# CACHE SIZING VS PRODUCT TYPE

- ▲ Typical Server Product
  - Virtualized
  - Large Working Set





## **IOMMU INTERNALS:** SUMMARY OF KEY DATA STRUCTURES

### IOMMU'S KEY DATA STRUCTURES



## DEVICE TABLE ENTRY

#### Each entry is 32B



### INTERRUPT REMAPPING TABLE ENTRY

Each entry is 128b. Two modes:

Interrupt Remapping (guest mode=0)

Interrupt Virtualization (guest mode=1)

guest mode=0:



#### AGENDA



**Research Opportunities and Tools** 

293 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016

- ▲ Isolation from malicious or buggy third party accelerators
  - Can IOMMU ensure protection in-presence of untrusted accelerators?

- Isolation from malicious or buggy third party accelerators
  - Can IOMMU ensure protection in-presence of untrusted accelerators?
- ▲ Specializing IOMMU for performance and power
  - Can IOMMU hardware exploit predictable access pattern of some accelerators?

- Isolation from malicious or buggy third party accelerators
  - Can IOMMU ensure protection in-presence of untrusted accelerators?
- ▲ Specializing IOMMU for performance and power
  - Can IOMMU hardware exploit predictable access pattern of some accelerators?
- Trading memory protection for performance

#### **RESEARCH DIRECTIONS**

- Isolation from malicious or buggy third party accelerators
  - Can IOMMU ensure protection in-presence of untrusted accelerators?
- ▲ Specializing IOMMU for performance and power
  - Can IOMMU hardware exploit predictable access pattern of some accelerators?
- Trading memory protection for performance
  - Can selectively lowering protection enable better performance?

#### **RESEARCH DIRECTIONS**

- Isolation from malicious or buggy third party accelerators
  - Can IOMMU ensure protection in-presence of untrusted accelerators?
- ▲ Specializing IOMMU for performance and power
  - Can IOMMU hardware exploit predictable access pattern of some accelerators?
- Trading memory protection for performance
  - Can selectively lowering protection enable better performance?
- Extending (limited) virtual memory to embedded accelerators
  - Can we design for IOMMU<sup>LITE</sup> embedded low-power accelerators?

#### **RESEARCH DIRECTIONS**

- Isolation from malicious or buggy third party accelerators
  - Can IOMMU ensure protection in-presence of untrusted accelerators?
- Specializing IOMMU for performance and power
  - Can IOMMU hardware exploit predictable access pattern of some accelerators?
- Trading memory protection for performance
  - Can selectively lowering protection enable better performance?
- Extending (limited) virtual memory to embedded accelerators
  - Can we design for IOMMU<sup>LITE</sup> embedded low-power accelerators?
- ▲ Avoiding interference in the IOMMU
  - How to reduce interference among multiple devices accessing IOMMU?

#### ISOLATION FROM THIRD PARTY ACCELERATORS



### ISOLATION FROM THIRD PARTY ACCELERATORS



EMERGENCE OF 3<sup>RD</sup> PARTY ACCELERATORS **3rd Party** (Un-trusted) Core Core Accelerator Accelerator MMU **MMU IOMMU** Memory

## ISOLATION FROM THIRD PARTY ACCELERATORS



**Q:** How to integrate third party accelerators efficiently and securely?

How to determine if a device is trustworthy and remains trustworthy?

May not be possible verify if 3<sup>rd</sup> party accelerator is not buggy.











IOMMU design(s) resembles CPU MMU design

- But device/accelerator access patterns differs from CPU's
- IOMMU caters to disparate devices
  - Single design point may not be optimal for all
  - e.g., access pattern from GPU likely different from NIC's

- ▲ IOMMU design(s) resembles CPU MMU design
  - But device/accelerator access patterns differs from CPU's
- ▲ IOMMU caters to disparate devices
  - Single design point may not be optimal for all
  - e.g., access pattern from GPU likely different from NIC's

#### Study traffic pattern to IOMMU and specialize for common patterns

- **Related work**: Malka et al. 's "rIOMMU" in ASPLOS'15.
  - Idea: Exploit *predictable* IOMMU accesses from devices using circular ring buffers

- ▲ IOMMU design(s) resembles CPU MMU design
  - But device/accelerator access patterns differs from CPU's
- ▲ IOMMU caters to disparate devices
  - Single design point may not be optimal for all
  - e.g., access pattern from GPU likely different from NIC's

#### Study traffic pattern to IOMMU and specialize for common patterns

- **Related work**: Malka et al. 's "rIOMMU" in ASPLOS'15.
  - Idea: Exploit *predictable* IOMMU accesses from devices using circular ring buffers

- Replace page table with circular, flat table  $\rightarrow$  Easy page walk
- Predictable access  $\rightarrow$  single entry IOTLB with no TLB miss and less invalidation

- ▲ IOMMU design(s) resembles CPU MMU design
  - But device/accelerator access patterns differs from CPU's
- ▲ IOMMU caters to disparate devices
  - Single design point may not be optimal for all
  - e.g., access pattern from GPU likely different from NIC's

#### Study traffic pattern to IOMMU and specialize for common patterns

- **Related work**: Malka et al. 's "rIOMMU" in ASPLOS'15.
  - Idea: Exploit *predictable* IOMMU accesses from devices using circular ring buffers

- Replace page table with circular, flat table  $\rightarrow$  Easy page walk
- Predictable access  $\rightarrow$  single entry IOTLB with no TLB miss and less invalidation
- Possible to use device-specific knowledge to optimize performance
  - IOMMU prefetching and TLB caching hints can be useful
  - Replacement policy coordination between IOTLB (Device TLB) and IOMMU TLB
  - Energy/power optimization in IOMMU

#### TRADING PROTECTION FOR PERFORMANCE

▲ IOMMU hardware allows lowering protection for performance

- For example: pre-translated DMA transactions pass-through IOMMU
- A *trusted* IO device can manipulate any address, including interrupt storms

#### TRADING PROTECTION FOR PERFORMANCE

▲ IOMMU hardware allows lowering protection for performance

- For example: pre-translated DMA transactions pass-through IOMMU
- A *trusted* IO device can manipulate any address, including interrupt storms

- ▲ OS policies for trading off protection for security
  - Should the sysadmin decide how much to trust a device/driver?
  - Exposing software knobs for dialing performance vs. protection
  - Related work: OS policies for *Strict* vs *Deferred* protection strategy [WILMANN'08, BEN-YEHUDA'07, AMIT'11]
  - ASPLOS'16: Strict, sub-page grain protection through Shadow DMA-buffer [MARKUZE'16]

#### IOMMULITE FOR EMBEDDED LOW-POWER ACCELERATORS AMD

▲ Virtual memory eases programming (e.g., "pointer-is-pointer")

- But comes at performance and energy cost
- ▲ Stripped-down IOMMU for **ultra low-power** accelerators
  - Lower hardware, performance, power cost by stripping non-essential features
  - Example "non-essential" features: IO virtualization support, Interrupt remapping,
    Page fault handling, Nested page table walker, etc.

#### IOMMULITE FOR EMBEDDED LOW-POWER ACCELERATORS AMD

▲ Virtual memory eases programming (e.g., "pointer-is-pointer")

- But comes at performance and energy cost
- Stripped-down IOMMU for ultra low-power accelerators
  - Lower hardware, performance, power cost by stripping non-essential features
  - Example "non-essential" features: IO virtualization support, Interrupt remapping,
    Page fault handling, Nested page table walker, etc.

#### Related work:

- Vogel et al.'s "Lightweight Virtual Memory" in CODES'15 [VOGEL'15]
  - Idea: Software managed IOMMU for FPGA ightarrow No translation miss handling in hardware
  - Simple design, high performance with effective software management

## AVOIDING (DESTRUCTIVE-) INTERFERENCE IN IOMMU



## AVOIDING (DESTRUCTIVE-) INTERFERENCE IN IOMMU







#### **RESEARCH: TOOLS AND MODELING**

▲ Software research: IOMMU driver/OS policies

- Easy! Open source IOMMU Driver in Linux

▲ Hardware research: Modifying IOMMU hardware behavior

- Option 1: Hardware performance counter + Analytical models
- Option 2: Simulator with IOMMU model
  - Work in progress to add IOMMU model in gem5
  - Write down in attendance sheet your email if interested

#### SUMMARY

### IOMMU (kernel-mode) Driver:

#### **Configuration/Setup IOMMU hardware**



#### REFERENCES

- IOMMU specification: <u>http://support.amd.com/TechDocs/48882\_IOMMU.pdf</u>
- OLSON'15: Lean Olson et. al. "Border Control: Sandboxing Accelerators", MICRO 2015
- AMIT'11: Nadav Amit et al. "vIOMMU: Efficient IOMMU Emulation", USENIX, ATC, 2011
- BEN-YEHUDA'07: Muli Ben-Yehuda et al. "The Price of Safety: Evaluating IOMMU Performance", OLS 2007
- MALKA'15: Moshe Malka et al. "rIOMMU: Efficient IOMMU for I/O Devices That Employ Ring Buffers", ASPLOS 2015.
- WILLMANN'08: Paul Willmann et al. "Protection Strategies for Direct Access to Virtualized I/O Devices", USENIX, ATC 2008.
- VOGEI'15: Pirmin Vogel et. al. "Lightweight virtual memory support for many-core accelerators in heterogeneous embedded SoCs", CODES'15
- MARKUZE'16: Markuze et al. "True IOMMU Protection from DMA Attacks", ASPLOS'16.

# **QUESTIONS AND FEEDBACK**

#### ▲ Reachable @

- Arka Basu: Arkaprava "dot" Basu "at" amd.com
- Andy Kegel: Andrew "dot" Kegel "at" amd.com
- Paul Blinzer: Paul "dot" Blinzer "at" amd.com
- Maggie Chan: Maggie "dot" Chan "at" amd.com

#### **DISCLAIMER & ATTRIBUTION**

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

#### ATTRIBUTION

© 2016 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). OpenCL is a trademark of Apple Inc. used by permission by Khronos. ARM <sup>®</sup> is/are the registered trademark(s) of ARM Limited in the EU and other countries. PCIe<sup>®</sup> is registered trademark of PCI-SIG corporation. Other name are for informational purposes only and may be trademarks of their respective owners.