# COMP 422 Parallel Computing: An Introduction

John Mellor-Crummey

Department of Computer Science Rice University johnmc@rice.edu



#### **Course Information**

- Meeting time: TTh 1:00-2:15
- Meeting place: DH 1075
- Instructor: John Mellor-Crummey
  - —Email: johnmc@cs.rice.edu
  - ---Office: DH 3082, x5179

-Office Hours: Wednesday 8:30am-9:30am or by appointment

#### Teaching Assistants

—Sriraj Paulemail: srp7@rice.eduoffice: DH 3064—Chaoran Yangemail: chaoran@rice.eduoffice: DH 3053—Office Hours: TBA

• WWW site: http://www.clear.rice.edu/comp422

#### **Parallelism**

- Definition: ability to execute parts of a program concurrently
- Goal: shorter running time
- Grain of parallelism: how big are the units? —bits, instructions, statements, loop iterations, procedures, ...
- COMP 422 focus: explicit thread-level parallelism
  —thread = a flow of control executing a sequence of instructions

### **Course Objectives**

- Learn fundamentals of parallel computing
  - ---principles of parallel algorithm design
  - -programming models and methods

  - -parallel algorithms
  - -modeling and analysis of parallel programs and systems
- Develop skill writing parallel programs

-programming assignments

 Develop skill analyzing parallel computing problems —solving problems posed in class

#### **Recommended Text**



#### Introduction to Parallel Computing, 2nd Edition

Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar

Addison-Wesley 2003

# **Topics (Part 1)**

- Introduction
- Principles of parallel algorithm design (Chapter 3)
  - -decomposition techniques
  - —mapping & scheduling computation
  - -templates
- Programming shared-address space systems (Chapter 7)
  - -Cilk/Cilk++
  - -OpenMP
  - -Pthreads
  - -synchronization
- Parallel computer architectures (Chapter 2)
  - ----shared memory systems and cache coherence
  - -distributed-memory systems
  - -interconnection networks and routing

# **Topics (Part 2)**

- Programming scalable systems (Chapter 6)
  - -message passing: MPI
  - -global address space languages
- Collective communication
- Analytical modeling of program performance (Chapter 5) —speedup, efficiency, scalability, cost optimality, isoefficiency
- Parallel algorithms (Chapters 8 & 10)
  - —non-numerical algorithms: sorting, graphs, dynamic prog.—numerical algorithms: dense and sparse matrix algorithms
- Performance measurement and analysis of parallel programs
- GPU Programming with CUDA
- Problem solving on clusters using MapReduce

#### **Prerequisites**

- Programming in C and/or Fortran
- Basics of data structures
- Basics of machine architecture
- Prerequisites

• See me if you have concerns

#### **Motivations for Parallel Computing**

- Technology push
- Application pull

The Rise of Multicore Processors

#### Advance of Semiconductors: "Moore's Law"

#### **Gordon Moore, Founder of Intel**

- 1965: since the integrated circuit was invented, the number of transistors/inch<sup>2</sup> in these circuits roughly doubled every year; this trend would continue for the foreseeable future
- 1975: revised circuit complexity doubles every two years



Image credit: http://download.intel.com/research/silicon/Gordon\_Moore\_ISSCC\_021003.pdf

#### **Evolution of Microprocessors 1971-2009**







Intel 4004, 1971 1 core, no cache 23K transistors Intel 8008, 1978 1 core, no cache **29K transistors** 

Intel Nehalem-EX, 2009 8 cores, 24MB cache 2.3B transistors

Figure credit: Shekhar Borkar, Andrew A. Chien, The Future of Microprocessors. Communications of the ACM, Vol. 54 No. 5, Pages 67-77 10.1145/1941487.1941507.

Graphics Processors NVIDIA Kepler GK110, May 2011 7.1B transistors

### Leveraging Moore's Law Trends

#### From increasing circuit density to performance

- More transistors = ↑ opportunities for exploiting parallelism
- Microprocessors
  - —implicit parallelism: invisible to the programmer
    - pipelined execution of instructions
    - multiple functional units for multiple independent pipelines
  - —explicit parallelism
    - SIMD processor extensions
      - operations on 1, 2, and 4 data items per instruction
      - integer, floating point, complex data

e.g. SSE, AVX

long instruction words (VLIW)

bundles of independent instructions that can be issued together

e.g., Itanium processor

### **Microprocessor Architecture (Mid 90's)**

- Superscalar (SS) designs were the state of the art

  - -dynamic scheduling: HW tracks dependencies between instructions
  - -speculative execution: look past predicted branches
- Apparent path to higher performance?
  - -wider instruction issue
  - -support for more speculation

### The End of the Free Lunch

**Increasing issue width provides diminishing returns** 

Two factors<sup>1</sup>

• Fundamental circuit limitations

—delays ↑ as issue queues ↑ and multi-port register files ↑
 —increasing delays limit performance returns from wider issue

- Limited amount of instruction-level parallelism
  - —inefficient for codes with difficult-to-predict branches

<sup>1</sup><u>The case for a single-chip multiprocessor</u>, K. Olukotun, B. Nayfeh, L. Hammond, K. Wilson, and K. Chang, ASPLOS-VII, 1996.



#### **Sources of Wasted Issue Slots**

- TLB miss
- I cache miss
- D cache miss
- Load delays (L1 hits)
- Branch misprediction
- Instruction dependences
- Memory conflict

Memory Hierarchy

**Control Flow** 

**Instruction Stream** 

#### **Simulations of 8-issue Superscalar**



#### **Power and Heat Stall Clock Frequencies**

#### **New York Times**

May 17, 2004 ... Intel, the world's largest chip maker, publicly acknowledged that it had hit a "thermal wall" on its microprocessor line. As a result, the company is changing its product strategy and disbanding one of its most advanced design groups. Intel also said that it would abandon two advanced chip development projects ...

Now, Intel is embarked on a course already adopted by some of its major rivals: obtaining more computing power by stamping multiple processors on a single chip rather than straining to increase the speed of a single processor ... Intel's decision to change course and embrace a "dual core" processor structure shows the challenge of overcoming the effects of heat generated by the constant on-off movement of tiny switches in modern computers ... some analysts and former Intel designers said that *Intel was coming to terms with escalating heat problems so severe they threatened to cause its chips to fracture at extreme temperatures...* 

#### **Technology Trends**



#### **Recent Multicore Processors**

- Sept 13: Intel Ivy Bridge-EP Xeon E5-2695 v2 — 12 cores; 2-way SMT; 30MB cache
- March 13: SPARC T5
  - 16 cores; 8-way fine-grain MT per core
- May 12: AMD Trinity — 4 CPU cores; 384 graphics cores
- Nov 12: Intel Xeon Phi coprocessor — ~60 cores
- Feb 12: Blue Gene/Q
  - 17 cores; 4-way SMT
- Q4 11: Intel Ivy Bridge
   4 cores; 2 way SMT;
- November 11: AMD Interlagos

— 16 cores

• Jan 10: IBM Power 7

- 8 cores; 4-way SMT; 32MB shared cache

• Tilera TilePro64



Figure credit: Ruud Haring, Blue Gene/Q compute chip, Hot Chips 23, August, 2011.

#### SPARC T5 (March 26 2013)



- 16 S3 cores@ 3.6GHz
- 8MB shared L3 Cache
- 8 DDR3 BL8 Schedulers providing 80 GB/s BW
- 8-way 1-hop glueless scalability
- Integrated 2x8 PCIe Gen 3
- Advanced Power Management with DVFS
- 8 threads/core
- Dual-issue pipeline
- On-die accelerators for
  - -encryption, RNG, SHA

The 5th Generation of Sparc CMT: T5 Rick Hetherington

## **Application Pull**

- Complex problems require computation on large-scale data
- Sufficient performance available only through massive parallelism

## **Computing and Science**

- "Computational modeling and simulation are among the most significant developments in the practice of scientific inquiry in the 20th century. Within the last two decades, scientific computing has become an important contributor to all scientific disciplines.
- It is particularly important for the solution of research problems that are insoluble by traditional scientific theoretical and experimental approaches, hazardous to study in the laboratory, or time consuming or expensive to solve by traditional means"

 — "Scientific Discovery through Advanced Computing" DOE Office of Science, 2000

#### **The Need for Speed: Complex Problems**

- Science

  - -storm forecasting and climate prediction

#### • Engineering

- -combustion and engine design
- -computational fluid dynamics and airplane design
- -earthquake and structural modeling
- -pollution modeling and remediation planning
- ----molecular nanotechnology
- Business
  - -computational finance high frequency trading
  - —information retrieval
  - -data mining
- Defense
  - -nuclear weapons stewardship
  - -cryptology

#### The Scientific Case for Exascale Computing

- Predict regional climate changes: sea level rise, drought and flooding, and severe weather patterns
- Reduce carbon footprint of transportation
- Improve efficiency and safety of nuclear energy
- Improve design for cost-effective renewable energy resources such as batteries, catalysts, and biofuels
- Certify the U.S. nuclear stockpile
- Design advanced experimental facilities, such as accelerators, and magnetic and inertial confinement fusion
- Understand properties of fission and fusion reactions
- Reverse engineer the human brain
- Design advanced materials



#### **Earthquake Simulation**



Earthquake Research Institute, University of Tokyo Tonankai-Tokai Earthquake Scenario Photo Credit: The Earth Simulator Art Gallery, CD-ROM, March 2004

#### **Ocean Circulation Simulation**



Photo Credit: The Earth Simulator Art Gallery, CD-ROM, March 2004

## **Community Earth System Model (CESM)**

| Atmosphere Component                                                                    | CAM DATM (WRF)                               |
|-----------------------------------------------------------------------------------------|----------------------------------------------|
| CAM Modes: Multiple Dycores, Mult                                                       | iple Chemistry Options, WACCM, single column |
| Data-ATM: Multiple Forcing/Physics Modes                                                |                                              |
| Land Component                                                                          | CLM DLND (VIC)                               |
| CLM Modes: no BGC, BGC, Dynamic-Vegetation, BGC-DV, Prescribed-Veg, Urban               |                                              |
| Data-LND: Multiple Forcing/Physics Modes                                                |                                              |
| Ice Component                                                                           | CICE DICE                                    |
| CICE Modes: Fully Prognostic, Prescribed                                                |                                              |
| Data-ICE : Multiple Forcing/Physics Modes                                               |                                              |
| Ocean Component                                                                         | POP DOCN(SOM/DOM) (ROMS)                     |
| POP Modes: Ecosystem, Fully-coupled, Ocean-only, Multiple Physics Options               |                                              |
| Data-OCN : Multiple Forcing/Physics Modes (SOM/DOM)                                     |                                              |
| New Land Ice Component                                                                  |                                              |
| Coupler<br>Regridding, Merging, Calculation of ATM/OCN fluxes, Conservation diagnostics |                                              |
| Figure cour                                                                             | tesy of M. Vertenstein (NCAR)                |

#### **CESM Execution Configurations**



## **CESM Simulations on a Cray XT**



## **Simulating Turbulent Reacting Flows: S3D**

- Direct numerical simulation (DNS) of turbulent combustion
  - - PI: Jaqueline H. Chen, SNL
  - -2010: Cray XT (65M) Blue Gene/P(2M) CPU hours
  - —"High-fidelity simulations for clean and efficient combustion of alternative fuels"



#### • Science

- -study micro-physics of turbulent reacting flows
  - physical insight into chemistry turbulence interactions
- -simulate detailed chemistry; multi-physics (sprays, radiation, soot)
- —develop and validate reduced model descriptions used in macroscale simulations of engineering-level systems



Text and figures courtesy of Jacqueline H. Chen, SNL





#### **Fluid-Structure Interactions**

• Simulate ...

---rotational geometries (e.g. engines, pumps), flapping wings

- Traditionally, such simulations have used a fixed mesh —drawback: solution quality is only as good as initial mesh
- Dynamic mesh computational fluid dynamics
  - —integrate automatic mesh generation within parallel flow solver
    - nodes added in response to user-specified refinement criteria
    - nodes deleted when no longer needed
    - element connectivity changes to maintain minimum energy mesh
  - —mesh changes continuously as geometry + solution changes
- Example: 3D simulation of a hummingbird's flight

### Air Velocity (Front)



## Air Velocity (Side)



Andrew A. Johnson Army HPC Research Center

#### **Mesh Adaptation (front)**



36

# Mesh Adaptation (side)



# **Challenges of Explicit Parallelism**

• Algorithm development is harder

-complexity of specifying and coordinating concurrent activities

- Software development is much harder
  - —lack of standardized & effective development tools and programming models
  - -subtle program errors: race conditions
- Rapid pace of change in computer system architecture
  - —a great parallel algorithm for one machine may not be suitable for another
    - example: homogeneous multicore processors vs. GPGPU

# Hummingbird Simulation in UPC

- UPC: PGAS language for scalable parallel systems
- Application overview
  - -distribute mesh among the processors
  - -partition the mesh using recursive bisection
  - -each PE maintains and controls its piece of the mesh
    - has a list of nodes, faces, and elements
  - -communication and synchronization
    - read-from or write-to other PE's "entities" as required
    - processors frequently synchronize using barriers
    - use "broadcast" and "reduction" patterns
  - -constraint
    - only 1 processor may change the mesh at a time

# **Algorithm Sketch**

At each time step...

- Test if re-partitioning is required
- Set up interprocessor communication if mesh changed
- Split elements into vectorizable groups
- Calculate the refinement value at each mesh node
- Move the mesh
- Solve the coupled fluid-flow equation system
- Update the mesh to ensure mesh quality

-swap element faces to obtain a "Delaunay" mesh

- ----add nodes to locations where there are not enough
- -delete nodes from locations where there are too many
- -swap element faces to obtain a "Delaunay" mesh

# Parallel Hardware in the Large

#### **Hierarchical Parallelism in Supercomputers**

System

(64 cabinets, 64x32x32)

- Cores with pipelining and short vectors
- Multicore processors
- Shared-memory multiprocessor nodes



# Blue Gene/Q Packaging Hierarchy



#### **Achieving High Performance on Parallel Systems**

#### **Computation is only part of the picture**

- Memory latency and bandwidth
  - CPU rates have improved 4x as fast as memory over last decade
  - bridge speed gap using memory hierarchy
  - multicore exacerbates demand
- Interprocessor communication
- Input/output
  - I/O bandwidth to disk typically grows linearly with # processors



### **Historical Concurrency in Top 500 Systems**





# Scale of the Largest HPC Systems (Nov 2013)

| Rank | Site                                                                  | System                                                                                                                      | Cores     | Rmax<br>(TFlop/s) | Rpeak<br>(TFlop/s) | Power<br>(kW) | > 1.5M            |
|------|-----------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------|-----------|-------------------|--------------------|---------------|-------------------|
| 0    | National Super Computer Center<br>in Guangzhou<br>China               | Tianhe-2 (MilkyWay-2) - TH-IVB-FEP Cluster, Intel Xeon E5-<br>2692 12C 2.200GHz, TH Express-2, Intel Xeon Phi 31S1P<br>NUDT | 3,120,000 | 33,862.7          | 54,902.4           | 17,808        | cores             |
| 2    | DOE/SC/Oak Ridge National<br>Laboratory<br>United States              | Titan - Cray XK7 , Opteron 6274 16C 2.200GHz, Cray Gemini<br>interconnect, NVIDIA K20x<br>Cray Inc.                         | 560,640   | 17,590.0          | 27,112.5           | 8,209         |                   |
| 3    | DOE/NNSA/LLNL<br>United States                                        | Sequoia - BlueGene/Q, Power BQC 16C 1.60 GHz, Custom<br>IBM                                                                 | 1,572,864 | 17,173.2          | 20,132.7           | 7,890         | all               |
| 4    | RIKEN Advanced Institute for<br>Computational Science (AICS)<br>Japan | K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect<br>Fujitsu                                                             | 705,024   | 10,510.0          | 11,280.4           | 12,660        | > 100K<br>cores   |
| 5    | DOE/SC/Argonne National<br>Laboratory<br>United States                | Mira - BlueGene/Q, Power BQC 16C 1.60GHz, Custom<br>IBM                                                                     | 786,432   | 8,586.6           | 10,066.3           | 3,945         |                   |
| 6    | Swiss National Supercomputing<br>Centre (CSCS)<br>Switzerland         | Piz Daint - Cray XC30, Xeon E5-2670 8C 2.600GHz, Aries<br>interconnect , NVIDIA K20x<br>Cray Inc.                           | 115,984   | 6,271.0           | 7,788.9            | 2,325         | hybrid<br>CPU+GPU |
| 7    | Texas Advanced Computing<br>Center/Univ. of Texas<br>United States    | Stampede - PowerEdge C8220, Xeon E5-2680 8C 2.700GHz,<br>Infiniband FDR, Intel Xeon Phi SE10P<br>Dell                       | 462,462   | 5,168.1           | 8,520.1            | 4,510         |                   |
| 8    | Forschungszentrum Juelich (FZJ)<br>Germany                            | JUQUEEN - BlueGene/Q, Power BQC 16C 1.600GHz, Custom<br>Interconnect<br>IBM                                                 | 458,752   | 5,008.9           | 5,872.0            | 2,301         |                   |
| 9    | DOE/NNSA/LLNL<br>United States                                        | Vulcan - BlueGene/Q, Power BQC 16C 1.600GHz, Custom<br>Interconnect<br>IBM                                                  | 393,216   | 4,293.3           | 5,033.2            | 1,972         |                   |
| 10   | Leibniz Rechenzentrum<br>Germany                                      | SuperMUC - iDataPlex DX360M4, Xeon E5-2680 8C 2.70GHz,<br>Infiniband FDR                                                    | 147,456   | 2,897.0           | 3,185.1            | 3,423         | 47                |

# **Challenges of Parallelism in the Large**

- Parallel science applications are often very sophisticated — e.g. adaptive algorithms may require dynamic load balancing
- Multilevel parallelism is difficult to manage
- Extreme scale exacerbates inefficiencies
  - algorithmic scalability losses
  - serialization and load imbalance
  - communication or I/O bottlenecks
  - insufficient or inefficient parallelization
- Hard to achieve top performance even on individual nodes
  - contention for shared memory bandwidth
  - memory hierarchy utilization on multicore processors

# **Thursday's Class**

- Introduction to parallel algorithms
  - -tasks and decomposition
  - -processes and mapping
  - -processes versus processors
- Decomposition techniques
  - -recursive decomposition
  - -data decomposition
  - -exploratory decomposition
  - -hybrid decomposition
- Characteristics of tasks and interactions
  - -task generation, granularity, and context
  - -characteristics of task interactions

# **Parallel Machines for the Course**

• STIC

- -170 nodes, each with two 4-core Intel Nehalem processors
- -Infiniband interconnection network
- -no global shared memory

• Biou

- -48 nodes, each with four 8-core IBM Power7 processors
  - 4-way SMT (4 hardware threads per processor core); 256GB/node
- -Infiniband interconnection network
- -no global shared memory
- DAVinCl
  - —Intel Westmere processors

  - -Infiniband interconnection network
  - -no global shared memory