Linux Clusters Overview

Author: Blaise Barney, Lawrence Livermore National Laboratory

UCRL-WEB-150348

Abstract
Background of Linux Clusters at LLNL
Cluster Configurations and Scalable Units
LC Linux Cluster Systems
Intel Xeon Hardware Overview
AMD Opteron Hardware Overview
Infiniband Interconnect Overview
Software and Development Environment
Compilers
Exercise 1
MPI
Running Jobs
Debugging
Tools
Exercise 2
References and More Information

Abstract

This tutorial is intended to be an introduction to using LC's Linux clusters. It begins by providing a brief historical background of Linux clusters at LC, noting their success and adoption as a production, high performance computing platform. The primary hardware components of LC's Linux clusters are then presented, including the various types of nodes, processors and switch interconnects. The detailed hardware configuration for each of LC's production Linux clusters completes the hardware related information.

After covering the hardware related topics, software topics are discussed, including the LC development environment, compilers, and how to run both batch and interactive parallel jobs. Important issues in each of these areas are noted. Available debuggers and performance related tools/topics are briefly discussed, however detailed usage is beyond the scope of this tutorial. A lab exercise using one of LC's Linux clusters follows the presentation.

Level/Prerequisites: This tutorial is intended for those who are new to developing parallel programs in LC's Linux cluster environment. A basic understanding of parallel programming in C or Fortran is required. The material covered by the following tutorials would also be helpful:
EC3501: Livermore Computing Resources and Environment
EC4045: Moab and SLURM

Background of Linux Clusters at LLNL

The Linux Project:

LLNL first began experimenting with Linux clusters in 1999-2000 in a partnership with Compaq and Quadrics to port Quadrics software to Alpha Linux.
The Linux Project was started for several reasons:
- Cost: price-performance analysis demonstrated that near-commodity hardware in clusters running Linux could be more cost-effective than proprietary solutions;
- Focus: the decreasing importance of high-performance computing (HPC) relative to commodity purchases was making it more difficult to convince proprietary systems vendors to implement HPC specific solutions;
- Control: it was believed that by controlling the OS in-house, Livermore Computing could better support its customers;
- Community: the platform created could be leveraged by the general HPC community.
The objective of this effort was to apply LC's scalable systems strategy (the "Livermore Model") to commodity hardware running the open source Linux OS:
- Based on SMP compute nodes attached to a high-speed, low-latency interconnect.
- Uses OpenMP to exploit SMP parallelism within a node and MPI to exploit parallelism between nodes.
- Provides a POSIX interface parallel filesystem.
- Application toolset: C, C++ and Fortran compilers, scalable MPI/OpenMP GUI debugger, performance analysis tools.
- System management toolset: parallel cluster management tools, resource management, job scheduling, near-real-time accounting.

Alpha Linux Clusters:

The first Linux cluster implemented by LC was LX, a Compaq Alpha Linux system with no high-speed interconnect.
The first production Alpha cluster targeted to implement the full Livermore Model was Furnace, a 64-node system comprised of dual-CPU EV68 processors with a QSnet interconnect. However...
- Compaq announced the eventual discontinuation of the Alpha server line
- Intel Pentium 4 with favorable SPECfp performance was released just as Furnace was delivered.
This prompted Livermore to shift to an Intel IA32-based model for its Linux systems in July 2001.
Furnace's interconnect was allocated to the IA32-based PCR clusters (below) instead. It then operated as a loosely coupled cluster until it was decommissioned in 10/03.

PCR Clusters:

August 2001: The Parallel Capacity Resource (PCR) clusters were purchased from Silicon Graphics and Linux NetworX. Consisted of:
- Adelie: 128-node production cluster
- Emperor: 88-node production cluster
- Dev: 26-node development cluster
Each PCR compute node had two 1.7-GHz Intel Pentium 4 CPUs and a QsNet Elan3 interconnect.
Parallel file system was not implemented at that time - instead dedicated BlueArc NFS servers were used.
SCF resource only
The 16-node Pengra cluster was procured for the OCF to provide a less restrictive development environment for PCR related work in July, 2002.
For more information see the: 2002 Linux Project Report.

MCR Cluster...and More:

The success of the PCR clusters was followed by the purchase of the Multiprogrammatic Capability Resource (MCR) cluster in July, 2002 from Linux NetworX.
1152 node cluster comprised of dual-processor, 2.4 GHz Intel Xeons
MCR's procurement was intended to significantly increase the resources available to Multiprogrammatic and Institutional Computing (M&IC;) users.
MCR's configuration included the first production implementation of the Lustre parallel file system, an integral part of the "Livermore Model".
Debuted as #5 on the Top500 Supercomputers list in November, 2002, and then peaked at #3 in June, 2003.
For more information see: MCR background.
Convinced of the successfullness of this path, LC implemented several other IA-32 Linux clusters simultaneously with, or after, the MCR Linux cluster:

System Network Nodes CPUs/Cores Gflops
ALC OCF 960 1,920 9,216
LILAC SCF 768 1,536 9,186
ACE SCF 160 320 1,792
SPHERE OCF 96 192 1,075
GVIZ SCF 64 128 717
ILX OCF 67 134 678
PVC OCF 64 128 614

Which Led To Thunder...

In September, 2003 the RFP for LC's first IA-64 cluster was released. Proposal from California Digital Corporation, a small local company, was accepted.

1024 node system comprised of 4-CPU Itanium 2 "Madison Tiger4" nodes

Thunder debuted as #2 on the Top500 Supercomputers list in June, 2004. Peak performance was 22.9 Tflops.

For more information about Thunder see:

"Using Thunder" tutorial - computing.llnl.gov/tutorials/linux_clusters/thunder.html

The Peloton Systems:

In early 2006, LC launched its Opteron/Infiniband Linux cluster procurement with the release of the Peloton RFP.

Appro www.appro.com was awarded the contract in June, 2006.

Peloton clusters were built in 5.5 Tflop "scalable units" (SU) of ~144 nodes

All Peloton clusters used AMD dual-core Socket F Opterons:

8 cpus per node
2.4 GHz clock
Option to upgrade to 4-core Opteron "Deerhound" later (not taken)

The six Peloton systems represented a mix of resources: OCF, SCF, ASC, M&IC;, Capability and Capacity:

System Network Nodes CPUs/Cores Tflops
Atlas OCF 1,152 9,216 44.2
Minos SCF 864 6,912 33.2
Rhea SCF 576 4,608 22.1
Zeus OCF 288 2,304 11.1
Yana OCF 83 640 3.1
Hopi SCF 76 608 2.9

The last Peloton clusters were retired in June 2012.
Atlas Peloton Cluster

And Then, TLCC and TLCC2:

In July, 2007 the Tri-laboratory Linux Capacity Cluster (TLCC) RFP was released.

The TLCC procurement represents the first time that the Department of Energy/National Nuclear Security Administration (DOE/NNSA) has awarded a single purchase contract that covers all three national defense laboratories: Los Alamos, Sandia, and Livermore. Read the announcement HERE.

The TLCC architecture is very similar to the Peloton architecture: Opteron multi-core processors with an Infiniband interconnect. The primary difference is that TLCC clusters are quad-core instead of dual-core.

TLCC clusters were/are:

System Network Nodes CPUs/Cores Tflops
Juno SCF 1,152 18,432 162.2
Hera OCF 864 13,824 127.2
Eos SCF 288 4,608 40.6

In June, 2011 the TLCC2 procurement was announced, as a follow-on to the successful TLCC systems. Press releases:

Appro
NNSA

The TLCC2 systems are LC's current "capacity" compute platform, consisting of multiple Intel Xeon E5-2670 (Sandy Bridge EP), QDR Infiniband based clusters:

System Network Nodes CPUs/Cores Tflops
Zin SCF 2,916 46,656 961.1
Cab OCF-CZ 1,296 20,736 426.0
Rzmerl OCF-RZ 162 2,592 53.9
Pinot SNSI 162 2,592 53.9

And Now? What Next?

LC has continued to procure Linux clusters for various purposes, which follow the same basic architecture as the TLCC2 systems:

64-bit, x86 architecture (Intel)
multi-core, multi-socket
QDR Infiniband interconnect

As of June, 2015, future TLCC-type procurements are currently in progress. They will be called Commodity Technology Systems (CTS), and are expected to begin arriving in mid-2016.
Juno, Eos TLCC Clusters
Zin TLCC2 Cluster

Cluster Configurations and Scalable Units

Basic Components:

Currently, LC has several types of production Linux clusters based on the following processor architectures:
- Intel Xeon 8-core E5-2670 (Sandy Bridge - TLCC2) w/without NVIDIA GPUs
- Intel Xeon 12-core E5-2695 (Ivy Bridge)
- Intel Xeon 6-core X5660 (Westmere) w/without NVIDIA GPUs
- Intel Xeon 4-core E5530 (Nehalem)
- AMD Opteron 4-core(8354, 8356)
- AMD Opteron 6-core (8423)
- AMD Opteron 8-core (6128)
All of LC's Linux clusters differ in their configuration details, however they do share the same basic hardware building blocks:
- Nodes
- Frames / racks
- High speed Infiniband interconnect (most clusters)
- Other hardware (file systems, management hardware, etc.)

Nodes:

The basic building block of a Linux cluster is the node. A node is essentially an independent computer. Key features:
- Self-contained, diskless, multi-core computer.
- Low form-factor - Clusters nodes are very thin in order to save space.
- Rack Mounted - Nodes are mounted compactly in a drawer fashion to facilitate maintenance, reduced footprint, etc.
- Remote Management - There is no keyboard, mouse, monitor or other device typically used to interact with a computer. All node management occurs over the network from a "management" node.
Nodes are typically configured into 4 types, according to their function:
- Compute nodes - Nodes that run user jobs. The majority of nodes in a cluster. Compute nodes are typically split into one of two partitions: batch or interactive/debug.
- Login nodes - One or more per cluster. This is where you login for access to the compute nodes. Login nodes are also used to build your applications and control cluster jobs.
- Gateway (I/O) nodes - These nodes are dedicated fileservers. They connect the compute nodes to essential file systems which are mounted on disk storage devices, such as Lustre OSTs. The number of these nodes vary per cluster.
- Administrative/management nodes - Used by system administrators to manage the entire cluster. Not accessible to users.
Example images (click for larger image):

Single compute node - TLCC2
Single compute node - TLCC

Frames / Racks:

Frames are the physical cabinets that hold most of a cluster's components:
- Nodes of various types
- Switch components
- Other network components
- Parallel file system disk resources (usually in separate racks)
Vary in size/appearance between the different Linux clusters at LC.
Power and console management - frames include hardware and software that allow system administrators to perform most tasks remotely.
Example images below (click for larger image):

Frames - TLCC2
Frames - TLCC

Scalable Unit:

The basic building block of LC's production Linux clusters is called a "Scalable Unit" (SU). An SU consists of:
- Nodes (compute, login, management, gateway)
- First stage switches that connect to each node directly
- Miscellaneous management hardware
- Frames sufficient to house all of the hardware
- Additionally, a second stage switch is also needed for every 2 SUs in a multi-SU cluster (not shown).
The number of nodes in an SU depends upon the type of first and second stage switches being used. For example:
- Voltaire 24-port + 288-port Infiniband = 144 nodes
- QLogic 36-port + 324-port Infiniband = 162 nodes
Multiple SUs are combined to create a cluster. For example:
- 2 SU = 288 / 324 nodes
- 4 SU = 576 / 648 nodes
- 8 SU = 1152 / 1296 nodes
The SU design is meant to:
- Standardize configuration details across the enterprise
- Easily "grow" clusters in incremental units
- Leverage procurements and reduce costs across the Tri-labs
An example of a 2 SU cluster is shown below for illustrative purposes. Note that a frame holding the second level switch hardware is not shown.

LC Linux Cluster Systems

LC Linux Clusters Summary

The tables below summarizes the key characteristics of LC's Linux clusters, listed alphabetically by vendor.
Note that some systems are limited access and not Generally Available (GA)

Photos:

Several examples of LC's Linux clusters
Click on any image for a larger version

Sierra Hyperion Rzzeus

Zin Juno

Intel Xeon Hardware Overview

Intel Xeon Processor

Xeon 5500 and 5600 Processors:

LC has a long history of Linux clusters using Intel processors.

Currently, the Intel-based LC Linux clusters use two different types of Xeon "Nehalem / Westmere" processors: models E5530 and X5630

64-bit, x86 architecture

Either 4-core (5500) or 6-core (5600) CPU

Clockspeeds: 2.4 - 2.8 GHz (at LC)

Cache:

L1 Data: 32 KB, private
L1 Instruction: 32 KB, private
L2: 256 KB, private
L3: 8 MB (5500) or 12 MB (5600), shared

Memory bandwidth: 32 GB/sec

Intel SIMD instructions (SSE, SSE2, SSE3...)

Automatic power consumption / performance management

Two threads per core (hyper-threading)

Additional information:

List / history of Intel processors: en.wikipedia.org/wiki/List_of_Intel_microprocessors
List / history of Intel Xeon processors: http://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors
Nehelem/Westmere Microarchitecture: en.wikipedia.org/wiki/Nehalem_microarchitecture

Xeon 5600 Processor Chip and Die (Image source: Intel)
Intel Nehalem Microarchitecture (Image source: wikipedia.org)

Xeon E5-2670 Processor:

This is the Intel "Sandy Bridge EP" product. Follow on to the Nehalem / Westmere products.
Used in the Tri-lab TLCC2 clusters
64-bit, x86 architecture
Clockspeed: 2.6 GHz (at LC)
8-core (at LC)
Cache:
- L1 Data: 32 KB, private
- L1 Instruction: 32 KB, private
- L2: 256 KB, private
- L3: 20 MB, shared
Memory bandwidth: 51.2 GB/sec
"Turbo Boost" Technology: automatically allows processor cores to run faster than the base operating frequency if they're operating below power, current, and temperature specification limits. For additional information see: http://www.intel.com/content/www/us/en/architecture-and-technology/turbo-boost/turbo-boost-technology.html.
Two threads per core (hyper-threading)
Intel AVX (Advanced Vector eXtensions) Instructions:
- New and enhanced instruction set for 256-bit wide vector SIMD operations
- Designed for floating point intensive applications
- Operate on 4 double precision or 8 single precision operands.
- Details: Introduction to Intel Advanced Vector Extensions
Additional information:
- List / history of Intel processors: en.wikipedia.org/wiki/List_of_Intel_microprocessors
- List / history of Intel Xeon processors: http://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors
- "Sandy Bridge for Servers" by David Kanter: http://realworldtech.com/page.cfm?ArticleID=RWT072811020122
Intel Sandy Bridge EP (Image source: David Kanter)
Intel Sandy Bridge EP (Image source: Intel)

Chipsets/Motherboards:

Example images of LC's Intel Linux cluster motherboards are shown below.
Intel 5520 I/O Hub features list available HERE

Xeon 5600 node with motherboard
Click for larger image
Xeon 5500/5600 motherboard diagram
Click for larger image

Sandy Bridge EP node with motherboard
Click for larger image

AMD Opteron Hardware Overview

AMD Opteron Processor

Basics:

The AMD Opteron debuted in April, 2003 as the first 64-bit architecture compatible with the industry-standard x86 instruction set.
Built on AMD64 technology (formerly code named "Hammer") which extends full backward compatibility for 32-bit x86 software. Also enables 64-bit computing for non-x86 applications.
Initial offering was a single-core processor designed with multi-core in mind. Dual-core processor became available in April, 2005 and a quad-core processor in 2007.
Multi-core Opterons are used to build 2-, 4-, 8- and 16-way SMPs.
Target market: multi-processor servers and workstations. AMD also manufactures multi-core 64-bit processors (i.e. Athlon) for other markets.
Most of LC's Opteron based clusters are built from quad-core processors.
Because Opterons are x86 based, they are "little endian" like the Intel IA32 processor.
Reference list of AMD Opteron Processors: http://en.wikipedia.org/wiki/List_of_AMD_Opteron_microprocessors

Design Facts/Features Quad-core Processors: (at LC)

Socket F processor (multi-core) design

64-bit architecture providing full 64-bit memory addressing

Clockspeed: 2.2 - 2.3 GHz

Floating point units: 4 results per clock cycle

Caches:

Cache Quad-core
Dedicated L1 Instruction 64 KB, 64-byte line, 2-way associative
Dedicated L1 Data 64 KB, 64-byte line, 2-way associative
Dedicated L2 512 KB
Shared L3 2 MB

Direct Connect Architecture:

No front side bus - directly connects multiple processor cores, memory controller and I/O to the Socket F processor. Helps eliminate the bottlenecks inherent in a traditional front-side bus.

Integrated (on-die) memory controller. CPU-memory bandwidth is 10.7 GB/s per Socket F with DDR2-667 DIMMs.

HyperTransport interconnects (3) directly connect sockets to I/O subsystems, other chipsets, and other sockets (off-chip CPUs). Provide 8 GB/s bandwidth per link (4 GB/s each direction). Aggregate bandwidth of 24 GB/s per processor (shared by all cores).

Full support for Intel's SIMD vectorization instructions (SSE, SSE2, SSE3...)

Reduced power consumption:

Energy efficient DDR2 memory
AMD dynamic power management technology

AMD Virtualization: disparate applications can coexist on same system. Each application's environment (operating system, middleware, communications, etc.) is represented as a virtual machine.

Design features common to most modern processors:

Enhanced branch prediction capabilities
Speculative / out-of-order execution
Superscalar
Pipelined architecture

Opteron Diagram

Chipsets/Motherboards:

As with other vendor processors, AMD Opterons are commonly combined with other components to make a complete system.

Motherboards for Opteron systems are manufactured by a number of vendors. LC's machines use the SuperMicro H8QM8-2 motherboard for both dual-core and quad-core machines.

Key components:

Four AMD Opteron Socket F processors:

Quad-core: model 8354 (TLCC clusters)

16 DIMM sockets supporting DDR2 667/533/400 MHz SDRAM.

Quad-core: 32 GB/node (TLCC clusters)

AMD-8132 Chipset (PCI-X bridge)
nVidia MCP55 Pro Chipset (I/O Hub)
Two PCI-X 100 MHz slots
Two PCI-X 133/100 MHz slots
One PCI-Express x16 slot
One PCI-Express x8 slot
Two 1GB ethernet ports
Remote server management hardware
Ports/controllers for SCSI, SATA, IDE, USB, keyboard, mouse, serial, parallel, etc. (most of these not used by LC machines)

Photo at right, block diagram below

Infiniband Interconnect Overview

Infiniband Interconnect

The high speed interconnect for LC's Linux clusters is Infiniband.
The type of Infiniband hardware varies by cluster. In general:
- Most Intel Xeon clusters use 4x QDR QLogic switches and adapters
- A few older clusters use 4x DDR Voltaire switches and Mellanox adapters.
4x Infiniband:
- 4x = 4 times the base Infiniband link rate of 2.5 Gbits/sec, which equals 10 Gbits/sec, full duplex.
- SDR = Single Data Rate - 10 Gbits/sec (1.25 GB/s)
- DDR = Double Data Rate - 20 Gbits/sec (2.5 GB/s)
- QDR = Double Data Rate - 40 Gbits/sec (5 GB/s)

Primary components:

Adapter Card:
- Communications processor packaged on network PCI Express adapter card.
- Remote Direct Memory Access (RDMA) improves communication bandwidth by off-loading communications from the CPU.
- Provides the interface between a node and a two-stage network.
- Copper cabling connects each adapter card to a first stage switch, in most cases. A few LC clusters (ansel, catalyst) without a second stage switch use optic fiber to connect to a single core switch.
- QLogic QDR or Mellanox 4x DDR
QLogic IB Adapter Mellanox IB Adapter
1st Stage Switch:
- QLogic QDR 36-port: 18 ports connect to adapters in nodes and 18 ports connect to a second stage switch.
- Voltaire DDR 24-port: 12 ports connect to adapters in nodes and 12 ports connect to a second stage switch.
2nd Stage Switch:
- QLogic QDR 18-864 port: all used ports connect to a first stage switches via fiber optic cabling.
- Voltaire DDR 288-port: all ports connect to a first stage switches via copper cabling.
Switch images below (click for a larger image):

QLogic 1st and 2nd Stage Switches (back)

Voltaire 1st Stage Switches (front, back)

Voltaire 2nd Stage Switch (front, back)

Topology:

Two-stage, federated, bidirectional, fat-tree.
The number of second stage switches depends upon the number of scalable units (SUs) that comprise the cluster and the type of switch hardware used.
Examples:

1152-way Interconnect
Juno - 8 SU
1944-way Interconnect
Sierra - 12 SU

Performance:

The inter-node bandwidth measurements below were taken on live, heavily loaded, LC machines using a simple MPI non-blocking test code. One task on each of two nodes. Not all systems are represented. Your mileage may vary.

System Type Latency Bandwidth
Older Clusters with DDR Voltaire/Mellanox (Juno) ~3-5 us ~2.4 GB/sec
Intel Xeon Clusters with QDR QLogic (Muir, Sierra, Ansel) ~1-2 us ~4.1 GB/sec
Intel Xeon Clusters with QDR QLogic (TLCC2) ~1 us ~5.0 GB/sec

System Type	Latency	Bandwidth
Older Clusters with DDR Voltaire/Mellanox (Juno)	~3-5 us	~2.4 GB/sec
Intel Xeon Clusters with QDR QLogic (Muir, Sierra, Ansel)	~1-2 us	~4.1 GB/sec
Intel Xeon Clusters with QDR QLogic (TLCC2)	~1 us	~5.0 GB/sec

Software and Development Environment

This section only provides a summary of the software and development environment for LC's Linux clusters. Please see the Introduction to LC Resources tutorial for details.

TOSS Operating System:

All LC Linux clusters use TOSS (Tri-Laboratory Operating System Stack).
The primary components of TOSS include:
- Red Hat Enterprise Linux (RHEL) distribution with modifications to support targeted HPC hardware and cluster computing
- RHEL kernel optimized for large scale cluster computing
- OpenFabrics Enterprise Distribution InfiniBand software stack including MVAPICH and OpenMPI libraries
- Moab workload manager and SLURM resource manager
- Integrated Lustre and Panasas parallel file system software
- Scalable cluster administration tools
- Cluster monitoring tools
- GNU, C, C++ and Fortran90 compilers (GNU, Intel, PGI)
- Testing software framework for hardware and operating system validation

Batch Systems:

SLURM
- LC's Simple Linux Utility for Resource Management
- SLURM is the native job scheduling system on each cluster
- The SLURM resource manager on one cluster does not communicate with the SLURM resource managers on other clusters
- Sits "underneath" Moab
- SLURM documentation can be found at: http://slurm.schedmd.com/documentation.html
Moab
- Tri-lab common workload scheduler
- Top-level batch system for all LC clusters - is integrated with, and sits on top of SLURM.
- Unlike SLURM, Moab can manage work across multiple clusters, each of which is directly managed by SLURM
- Covered in depth in the Moab Tutorial.

File Systems:

Home directories:
- Globally mounted under /g/g#
- Backed up
- Not purged
- 16 GB quota in effect
- Convenient .snapshot directory for recent backups
Lustre parallel file systems:
- Mounted under /p/lscratch#
- Very large
- Not backed up
- Subject to purge
- Shared by all users on a cluster or multiple clusters
- Lustre is discussed in the Parallel File Systems section of the Introduction to Livermore Computing Resources tutorial.
- Are usually mounted by multiple clusters.
/nfs/tmp# - large globally mounted NFS file systems shared across all machines by all users, not backed up, purged, quotas in effect.
/var/tmp, /usr/tmp, /tmp - different names for the same file system, local (non-NFS) mounted, moderate size, not backed up, purged, shared by all users on a given node.
Archival HPSS storage - accessed by ftp storage. Virtually unlimited file space, not backed up or purged. More info: computing.llnl.gov/jobs/#sbp - see the section on storage.
/usr/gapps - globally mounted project workspace that may be group and/or world readable. More info: computing.llnl.gov/resources/content/usr_gapps.php.

Dotkit Packages:

LC provides the Dotkit system as a convenient, uniform way to select among multiple versions of software installed on the LC systems.
Dotkit provides a simple user interface that is standard across UNIX shells.
Dotkit is most often used to setup your environment for a particular compiler or application that isn't in your default environment.

Using Dotkit:

List available packages: use -l Load a package: use package_name Unload a package: unuse package_name List loaded packages: use Search available packages: use -l keyword Read package help info: use -h package_name Display package contents: use -hv package_name Show default packages: dpkg-defaults (look for asterisks)

For detailed information see the man pages and/or computing.llnl.gov/?set=jobs&page;=dotkit.

Compilers, Tools, Graphics and Other Software:

The table below lists and provides links to the majority of software available through LC or related organizations.

Software Category Description and More Information
Compilers Lists which compilers are available for each LC system: https://computing.llnl.gov/?set=code&page;=compilers
Supported Software and Computing Tools Development Environment Group supported software includes compilers, libraries, debugging, profiling, trace generation/visualization, performance analysis tools, correctness tools, and several utilities: https://computing.llnl.gov/?set=code&page;=software_tools.
Graphics Software Graphics Group supported software includes visualization tools, graphics libraries, and utilities for the plotting and conversion of data: https://computing.llnl.gov/vis/graphics_sw.shtml
Mathematical Software Overview Lists and describes the primary mathematical libraries and interactive mathematical tools available on LC machines: https://computing.llnl.gov/LCdocs/math/
LINMath The Livermore Interactive Numerical Mathematical Software Access Utility, is a Web-based access utility for math library software. The LINMath Web site also has pointers to packages available from external sources: http://www-lc.llnl.gov/linmath/
Open Source and Released Software Lists and describes software development efforts that LC is developing, sponsoring or collaborating with: https://computing.llnl.gov/?set=resources&page;=os_projects
Center for Applied Scientific Computing (CASC) Software A wide range of software available for download from LLNL's CASC. Includes mathematical software, language tools, PDE software frameworks, visualization, data analysis, program analysis, debugging, and benchmarks: https://computation.llnl.gov/casc/software.php

Software Category	Description and More Information
Compilers	Lists which compilers are available for each LC system: https://computing.llnl.gov/?set=code&page;=compilers
Supported Software and Computing Tools	Development Environment Group supported software includes compilers, libraries, debugging, profiling, trace generation/visualization, performance analysis tools, correctness tools, and several utilities: https://computing.llnl.gov/?set=code&page;=software_tools.
Graphics Software	Graphics Group supported software includes visualization tools, graphics libraries, and utilities for the plotting and conversion of data: https://computing.llnl.gov/vis/graphics_sw.shtml
Mathematical Software Overview	Lists and describes the primary mathematical libraries and interactive mathematical tools available on LC machines: https://computing.llnl.gov/LCdocs/math/
LINMath	The Livermore Interactive Numerical Mathematical Software Access Utility, is a Web-based access utility for math library software. The LINMath Web site also has pointers to packages available from external sources: http://www-lc.llnl.gov/linmath/
Open Source and Released Software	Lists and describes software development efforts that LC is developing, sponsoring or collaborating with: https://computing.llnl.gov/?set=resources&page;=os_projects
Center for Applied Scientific Computing (CASC) Software	A wide range of software available for download from LLNL's CASC. Includes mathematical software, language tools, PDE software frameworks, visualization, data analysis, program analysis, debugging, and benchmarks: https://computation.llnl.gov/casc/software.php

Man Pages:

Linux man pages (and SLURM man pages) are located in /usr/share/man.
If you have problems bringing up the man page for a Linux command, check your MANPATH variable for /usr/share/man.

Compilers

General Information

Available Compilers and Invocation Commands:

The table below summarizes compiler availability and invocation commands on LC Linux clusters.

Note that parallel compiler commands are actually LC scripts that ultimately invoke the corresponding serial compiler.

Compiler Serial Command Parallel Command
Intel C icc mpiicc
C++ icpc mpiicpc
Fortran ifort mpiifort
PGI C pgcc mpipgcc
C++ pgCC mpipgCC
Fortran pgf77 pgf90 mpipgf77 mpipgf90
GNU C gcc mpicc mpigcc
C++ g++ mpiCC mpig++
Fortran g77 gfortran mpif77 mpif90 mpigfortran

Compiler	Serial Command	Parallel Command
Intel	C	`icc`	`mpiicc`
C++	`icpc`	`mpiicpc`
Fortran	`ifort`	`mpiifort`
PGI	C	`pgcc`	`mpipgcc`
C++	`pgCC`	`mpipgCC`
Fortran	`pgf77 pgf90`	`mpipgf77 mpipgf90`
GNU	C	`gcc`	`mpicc mpigcc`
C++	`g++`	`mpiCC mpig++`
Fortran	`g77 gfortran`	`mpif77 mpif90 mpigfortran`

Versions, Defaults and Paths:

LC usually maintains multiple versions of each compiler on a cluster. To see all available versions, login and use the command use -l compiler. Example (truncated) below:

% use -l gnu compilers/gnu ---------- gcc-4.4.6 - GNU Compiler Collection (gcc/g++/gfortran) gcc-4.6.1 - GNU Compiler Collection (gcc/g++/gfortran) gcc-4.7.1p - Pure GNU Compilers (gcc/g++/gfortran) provided AS IS gcc-4.8.2p - Pure GNU Compilers (gcc/g++/gfortran) provided AS IS gcc-4.9.0p - Pure GNU Compilers (gcc/g++/gfortran) provided AS IS gcc-4.9.2p - Pure GNU Compilers (gcc/g++/gfortran) provided AS IS gcc - GNU Compiler Collection (gcc/g++/gfortran) ... [ output omitted here ] ... mpi/openmpi ---------- openmpi-gnu-1.4.3 - Open MPI v1.4.3 for GNU compilers openmpi-gnu-1.6.5 - Open MPI v1.6.5 for GNU compilers openmpi-gnu-1.8.3 - Open MPI v1.8.3 for GNU compilers openmpi-gnu-1.8.4 - Open MPI v1.8.4 for GNU compilers openmpi-gnu-debug-1.4.3 - Open MPI v1.4.3 for GNU compilers (debug) openmpi-gnu-debug-1.6.5 - Open MPI v1.6.5 for GNU compilers (debug) openmpi-gnu-debug-1.8.3 - Open MPI v1.8.3 for GNU compilers (debug) openmpi-gnu-debug-1.8.4 - Open MPI v1.8.4 for GNU compilers (debug) openmpi-gnu-debug - Open MPI v1.4.3 for GNU compilers (debug)

The see the default compiler version for LC machines: consult the Compilers Installed on LC Platforms web page. Alternately, use the dpkg-defaults command to list software versions (shown below). Defaults are denoted by an asterisk.

% dpkg-defaults icc icc: icc-10.0.025 icc-10.1.022 icc-11.1.046 icc-11.1.073 icc-12.0.191 icc-12.1.256 icc-12.1.273 icc-12.1.293 icc-12.1.339 *icc-14.0.174 icc-9.1.052

To determine the actual version you are using, issue the compiler invocation command with its "version" option. For example:

Compiler Option Example
Intel -V ifort -V
PGI -V pgf90 -V
GNU -v --version g++ --version
To use a specific version of a compiler other than the default do the following:
1. Login to the machine where you will be working
2. Issue the use -l compilers command to list all available compiler versions.
3. See what is available and select one, noting its package/dotkit name
4. Issue the command: use package-name

Compiler	Option	Example
Intel	`-V`	`ifort -V`
PGI	`-V`	`pgf90 -V`
GNU	`-v --version`	`g++ --version`

Floating-point Exceptions:

The IEEE floating point standard defines several exceptions (FPEs) that occur when the result of a floating point operation is unclear or undesirable:
- overflow: an operation's result is too large to be represented as a float. Can be trapped, or else returned as a +/- infinity.
- underflow: an operation's result is too small to be represented as a normalized float. Can be trapped, or else represented as as a denormalized float (zero exponent w/ non-zero fraction) or zero.
- divide-by-zero: attempting to divide a float by zero. Can be trapped, or else returned as a +/- infinity.
- inexact: result was rounded off. Can be trapped or returned as rounded result.
- invalid: an operation's result is ill-defined, such as 0/0 or the sqrt of a negative number. Can be trapped or returned as NaN (not a number).
By default, the Opteron and Xeon processors used at LC mask/ignore FPEs. Programs that encounter FPEs will not terminate abnormally, but instead, will continue execution with the potential of producing wrong results.

Compilers differ in their ability to handle FPEs:

Intel's Fortran compiler offers some means of controlling FPE handling through the -fpe* options.
Intel's C/C++ compiler provides the -fp-trap* options.
PGI compilers offer the convenient -Ktrap* options to unmask FPEs.
GNU Fortran: offers the -ffpe-trap=* options
GNU C/C++ compilers: using the feenableexcept() routine may help. See the man page for details. Simple example below:

#include <fenv.h> #include <stdio.h> int main(int argc, char **argv) { double zerodiv= 0.0; feenableexcept(-1); // Enable all floating point exceptions double nanval=0.0/zerodiv; printf("Should not get here: zerodiv=%lf, nanval=%lf\n",zerodiv,nanval); }

Unrelated to compilers, Linux treats integer divides-by-zero as floating-point exceptions. For example, the following two codes will behave differently. The integer divide by zero will abort execution, while the floating point divide by zero will not. This applies to both Fortran and C/C++.
#include <stdio.h> int main() { int i = 2; i /= 0; printf("i = %d\n",i); }

#include <stdio.h> int main() { float i = 2; i /= 0; printf("i = %f\n",i); }

Precision, Performance and IEEE 754 Compliance:

Typically, most compilers do not guarantee IEEE 754 compliance for floating-point arithmetic unless it is explicitly specified by a compiler flag. This is because compiler optimizations are performed at the possible expense of precision and performance.
Unfortunately for most programs, adhering to IEEE floating-point arithmetic adversely affects performance.
If you are not sure whether your application needs this, try compiling and running your program both with and without it to evaluate the effects on both performance and precision.
See the relevant compiler documentation for details.

Mixing C and Fortran:

If you are linking C/C++ and FORTRAN code together, and need to explicitly specify the FORTRAN or C/C++ libraries on the link line, LC provides a general recommendation and example in the /usr/local/docs/linux.basics file. See the "MIXING C AND FORTRAN" section.
All of the other issues involved with mixed language programming apply, such as:
- Column-major vs. row-major array ordering
- Routine name differences - appended underscores
- Arguments passed by reference versus by value
- Common blocks vs. extern structs
- Memory alignment differences
- File I/O - Fortran unit numbers vs. C/C++ file pointers
- C++ name mangling
- Data type differences
Some useful references:
- Mixed language programming using C++ and FORTRAN 77 from http://arnholm.org/software/cppf77/cppf77.htm
- Using C/C++ and Fortran Together from http://www.yolinux.com/TUTORIALS/LinuxTutorialMixingFortranAndC.html

Large Static Data:

If your executable contains a large amount of static data, LC recommends that you compile with the -mcmodel=medium option.
This flag allows the Data and .BSS (static variables) sections of your executable to extend beyond a default 2 GB limit.
True for all compilers (see the respective compiler man page).

Compiler Documentation and Man Pages:

Man Pages:
- Compiler man pages are for the default version of the compiler.
- If you want to see the man pages for a different compiler version, load your environment with the compiler version of interest with the use package command first. For example: use icc-12.1.339. Then issue the man command.
- Additionally the info utility can be used to view man pages and info docs.
Documentation:
- Intel: compiler docs are included in the installation directory under /usr/local/tools/compilername. Otherwise, see Intel's web pages.
- PGI: compiler docs are included in the installation directory under /usr/local/tools/pgi. Otherwise, see PGI's web pages.
- GNU: see the web pages at https://gcc.gnu.org/

Compilers

Intel Compilers

General Information:

The Intel compilers are specifically designed to take advantage of Intel chip architectures.
For Opteron clusters, LC supports the Intel compilers with the caveat that they have not been rigorously tested on this architecture.
Installed under /usr/local/tools/icc* and /usr/local/tools/ifort*.
There can be several versions of the Intel compilers available. To see what is currently available on the machine you are using, issue the use -l compilers command. To select a particular version, use the use package-name command.
Full compiler documentation is available in the installation directory. Complete man pages are also available for compiler version after you load the package of choice as described above.
GNU compatibility: Objects created with the Intel C/C++ compiler are binary compatible with GNU gcc. Also, the Intel C/C++ compiler supports many of the language extensions of gcc and g++. See the Intel C++ Compiler User's Guide for details.

Compiler Invocation Commands:

`icc`	serial/OpenMP C
`icpc`	serial/OpenMP C++
`ifort`	serial/OpenMP Fortran 77 and 90
`mpiicc`	script for C with MPI
`mpiicpc`	script for C++ with MPI
`mpiifort`	script for Fortran with MPI

Note: The Intel C and C++ compiler are actually the same compiler - the Intel C++ compiler product. The invocation command (icc or icpc) determines how the compiler behaves. Be sure to use the appropriate invocation command for your source - don't use icc for C++ or icpc for C.

Common / Useful Options:

A few useful compiler options are shown below. Interested users will definitely want to consult the relevant Intel documentation.

.

Option Description C/C++ Fortran
-I, -L, -l, -lm, -c, -o As usual
-ansi_alias -no-ansi_alias Can help performance. Directs the compiler to assume the following:

-Arrays are not accessed out of bounds.
-Pointers are not cast to non-pointer types, and vice-versa.
-References to objects of two different scalar types cannot alias.
If your program does not satisfy one of the above conditions, this flag may lead the compiler to generate incorrect code. For Fortran, conformance is according to Fortran 95 Standard type aliasability rules.
C/C++ Default = -no-ansi_alias (off)
Fortran Default = -ansi_alias (on)
-assume keyword -assume buffered_io Specifies assumptions made by the compiler. One option that may improve I/O performance is buffered_io, which causes sequential file I/O to be buffered rather than being written to disk immediately. See the ifort man page for details.
-auto -automatic -nosave
-save -noauto -noautomatic
Places variables, except those declared as SAVE, on the run-time stack. The default is -auto_scalar (local scalar of types INTEGER, REAL, COMPLEX, or LOGICAL are automatic). However, if you specify -recursive or -openmp, the default is -auto.
Places variables, except those declared as AUTOMATIC, in static memory. However, if you specify -recursive or -openmp, the default is -auto.

-autodouble Defines real variables to be REAL(KIND=8). Same as specifying -r8.
-convert keyword Specifies the format for unformatted files, such as big endian, little endian, IBM 370, Cray, etc.
-Dname[=value] Defines a macro name and associates it with a specified value. Equivalent to a #define preprocessor directive.
-fast Shorthand for several combined optimization options: -O3, -ipo -static
-fpp -cpp Invoke Fortran preprocessor. -fpp and -cpp are equivalent.
-fpe[n] (fortran)
-fpe-all=[n] (fortran)
-fp-trap=[mode] (C/C++)
-fp-trap-all=[mode] (C/C++) Specifies the run-time floating-point exception handling behavior. See the compiler man page for details.
-g Build with debugging symbols. Note that -g does not imply -O0 in the Intel compilers; -O0 must be specified explicitly to turn all optimizations off.
-help Print compiler options summary
-ip Enable single-file interprocedural optimizations.
-ipo Enable multi-file interprocedural optimizations.
-mcmodel=medium Required if executable is greater than 2 GB. The program is linked in the lower 2 GB of the address space but symbols can be located anywhere in the address space. Programs can be statically or dynamically linked, but building of shared libraries are not supported with the medium model.
-mp 'Maintain precision' - favor conformance to IEEE 754 standards for floating-point arithmetic.
-mp1 Improve floating-point precision - less speed impact than -mp.
-O0 Turn off optimizer - recommended if using -g for debugging.
-O, -O1, -O2, -O3 Optimization levels. (O,O1,O2 are essentially equivalent). -O3 is the most aggressive optimization level. Note that Intel compilers perform optimization by default.
-openmp Enable OpenMP
-opt_report -opt_report_file filename -opt_report_level [min|med|max] -openmp_report[0|1|2] -par_report[0|1|2|3] Various reporting options on optimization, OpenMP, or auto-parallelization. See man pages for more information.
-p, -pg Enables function profiling with the gprof tool. Same as -qp
-parallel Enable auto-parallelizer to generate multi-threaded code for eligible loops.
-prof_gen -prof_file -prof_use Used for profile guided optimization.
-pthread, -lpthread Link with Pthreads library
-r8 -r16 -real_size 64 -real_size 128 Different ways to specify the default size of real and/or double-precision numbers.
-shared Create a shared object (.a, .so)
-static Enables linking to shared libraries (.so) statically.
-V Display compiler version information
-w Disable all warning messages
-w[0|1|2] Increasing levels of warning message reporting. Default=1
-Wall (C/C++) -warn (Fortran) Enable all warning messages
-vec Utilize streaming SIMD instructions - turns on auto-vectorizer. This is the default.
-xavx Utilize the advanced vector extension (AVX) streaming SIMD instructions. Sandy Bridge processor only.

Option	Description	C/C++	Fortran
`-I, -L, -l, -lm, -c, -o`	As usual
`-ansi_alias -no-ansi_alias`	Can help performance. Directs the compiler to assume the following: -Arrays are not accessed out of bounds. -Pointers are not cast to non-pointer types, and vice-versa. -References to objects of two different scalar types cannot alias. If your program does not satisfy one of the above conditions, this flag may lead the compiler to generate incorrect code. For Fortran, conformance is according to Fortran 95 Standard type aliasability rules. C/C++ Default = -no-ansi_alias (off) Fortran Default = -ansi_alias (on)
`-assume keyword -assume buffered_io`	Specifies assumptions made by the compiler. One option that may improve I/O performance is `buffered_io`, which causes sequential file I/O to be buffered rather than being written to disk immediately. See the ifort man page for details.
`-auto -automatic -nosave` `-save -noauto -noautomatic`	Places variables, except those declared as SAVE, on the run-time stack. The default is -auto_scalar (local scalar of types INTEGER, REAL, COMPLEX, or LOGICAL are automatic). However, if you specify -recursive or -openmp, the default is -auto. Places variables, except those declared as AUTOMATIC, in static memory. However, if you specify -recursive or -openmp, the default is -auto.
`-autodouble`	Defines real variables to be REAL(KIND=8). Same as specifying -r8.
`-convert keyword`	Specifies the format for unformatted files, such as big endian, little endian, IBM 370, Cray, etc.
`-Dname[=value]`	Defines a macro name and associates it with a specified value. Equivalent to a #define preprocessor directive.
`-fast`	Shorthand for several combined optimization options: `-O3, -ipo -static`
`-fpp -cpp`	Invoke Fortran preprocessor. -fpp and -cpp are equivalent.
`-fpe[n]` (fortran) `-fpe-all=[n]` (fortran) `-fp-trap=[mode]` (C/C++) `-fp-trap-all=[mode]` (C/C++)	Specifies the run-time floating-point exception handling behavior. See the compiler man page for details.
`-g`	Build with debugging symbols. Note that -g does not imply -O0 in the Intel compilers; -O0 must be specified explicitly to turn all optimizations off.
`-help`	Print compiler options summary
`-ip`	Enable single-file interprocedural optimizations.
`-ipo`	Enable multi-file interprocedural optimizations.
`-mcmodel=medium`	Required if executable is greater than 2 GB. The program is linked in the lower 2 GB of the address space but symbols can be located anywhere in the address space. Programs can be statically or dynamically linked, but building of shared libraries are not supported with the medium model.
`-mp`	'Maintain precision' - favor conformance to IEEE 754 standards for floating-point arithmetic.
`-mp1`	Improve floating-point precision - less speed impact than -mp.
`-O0`	Turn off optimizer - recommended if using -g for debugging.
`-O, -O1, -O2, -O3`	Optimization levels. (O,O1,O2 are essentially equivalent). -O3 is the most aggressive optimization level. Note that Intel compilers perform optimization by default.
`-openmp`	Enable OpenMP
`-opt_report -opt_report_file filename -opt_report_level [min\|med\|max] -openmp_report[0\|1\|2] -par_report[0\|1\|2\|3]`	Various reporting options on optimization, OpenMP, or auto-parallelization. See man pages for more information.
`-p, -pg`	Enables function profiling with the `gprof` tool. Same as `-qp`
`-parallel`	Enable auto-parallelizer to generate multi-threaded code for eligible loops.
`-prof_gen -prof_file -prof_use`	Used for profile guided optimization.
`-pthread, -lpthread`	Link with Pthreads library
`-r8 -r16 -real_size 64 -real_size 128`	Different ways to specify the default size of real and/or double-precision numbers.
`-shared`	Create a shared object (.a, .so)
`-static`	Enables linking to shared libraries (.so) statically.
`-V`	Display compiler version information
`-w`	Disable all warning messages
`-w[0\|1\|2]`	Increasing levels of warning message reporting. Default=1
`-Wall (C/C++) -warn (Fortran)`	Enable all warning messages
`-vec`	Utilize streaming SIMD instructions - turns on auto-vectorizer. This is the default.
`-xavx`	Utilize the advanced vector extension (AVX) streaming SIMD instructions. Sandy Bridge processor only.

Compilers

PGI Compilers

General Information:

The PGI C, C++ and Fortran compilers as well as the PGI debugger and profiler tools are available on LC's Linux clusters.
The PGI compilers are placed in your PATH and MANPATH environment variables by default.
Installed under /usr/local/tools/pgi*.
There can be several versions of the PGI compilers available. To see what is currently available on the machine you are using, issue the use -l compilers command. To select a particular version, use the use package-name command.
Full PGI documentation is available in the doc install directory. Complete man pages are also available for every PGI tool after you load the version of choice as described above.
Support SSE instructions
The PGI compilers provide a compiler option for unmasking floating-point exceptions. See the -Ktrap option in the table below.

Compiler Invocation Commands:

`pgcc`	serial/OpenMP C
`pgCC`	serial/OpenMP C++
`pgf77 pgf90`	serial/OpenMP Fortran 77 and 90
`mpipgcc`	script for C with MPI
`mpipgCC`	script for C++ with MPI
`mpipgf77 mpipgf90`	scripts for Fortran 77 and Fortran 90 with MPI

Common / Useful Options:

A few useful compiler options are shown below. Interested users will definitely want to consult the relevant PGI documentation.

Option Description
-I, -L, -l, -lm, -c, -o As usual
-fast Turn on optimizations, including vectorization
-fpic Instructs the compiler to generate position independent code which can be used to create shared object files (dynamically linked libraries).
-help Display help
-Kieee Force IEEE 754 arithmetic
-Ktrap=[option] Unmask selected FPU exceptions. Options include fp,inv,denorm,divz,ovf,unf,inexact
-lpthread Link with pthreads library.
-mcmodel=medium Required if executable is greater than 2 GB. The program is linked in the lower 2 GB of the address space but symbols can be located anywhere in the address space. Programs can be statically or dynamically linked, but building of shared libraries are not supported with the medium model.
-mp Turn on OpenMP
-Mvect=[prefetch,sse] Enable prefetch, SSE
-Mlist Create a listing file
-Minfo=all Generate an optimization/vectorization report
-Mipa Enable interprocedural analysis. See man page for details.
-O, -O1, -O2, -O3/O4 Optimization levels. The default is -O1.
-pc32 -pc64 -pc80 Set precision of FPU significand to 32, 64, or 80 bits respectively
-p, -pg Enable gprof-style sample-based profiling
-V Display version information
-v Verbose mode
-w Suppress warning messages

Option	Description
`-I, -L, -l, -lm, -c, -o`	As usual
`-fast`	Turn on optimizations, including vectorization
`-fpic`	Instructs the compiler to generate position independent code which can be used to create shared object files (dynamically linked libraries).
`-help`	Display help
`-Kieee`	Force IEEE 754 arithmetic
`-Ktrap=[option]`	Unmask selected FPU exceptions. Options include `fp,inv,denorm,divz,ovf,unf,inexact`
`-lpthread`	Link with pthreads library.
`-mcmodel=medium`	Required if executable is greater than 2 GB. The program is linked in the lower 2 GB of the address space but symbols can be located anywhere in the address space. Programs can be statically or dynamically linked, but building of shared libraries are not supported with the medium model.
`-mp`	Turn on OpenMP
`-Mvect=[prefetch,sse]`	Enable prefetch, SSE
`-Mlist`	Create a listing file
`-Minfo=all`	Generate an optimization/vectorization report
`-Mipa`	Enable interprocedural analysis. See man page for details.
`-O, -O1, -O2, -O3/O4`	Optimization levels. The default is -O1.
`-pc32 -pc64 -pc80`	Set precision of FPU significand to 32, 64, or 80 bits respectively
`-p, -pg`	Enable gprof-style sample-based profiling
`-V`	Display version information
`-v`	Verbose mode
`-w`	Suppress warning messages

Compilers

GNU Compilers

General Information:

The GNU C, C++ and Fortran compilers are available on LC's Linux clusters.
The GNU compilers are placed in your PATH and MANPATH environment variables by default.
Installed under /usr/local/tools/gcc*.
There can be several versions of the GNU compilers available. To see what is currently available on the machine you are using, issue the use -l compilers command. To select a particular version, use the use package-name command.
Fortran compilers support all of the C compiler options, plus include options specific to Fortran.
Support SSE instructions. Additionally, the GNU C/C++ compiler supports Intel's AVX instruction set.
LC's GNU compilers on Linux systems are versions provided by RedHat, not vanilla versions available from GNU. They may perform differently or have different features than the vanilla GNU builds. Builds of GNU's versions of these compilers tend not to work on LC's Linux systems because of incompatibilities with binutils.
GNU compiler documentation - both C and Fortran: gcc.gnu.org/onlinedocs

Compiler Invocation Commands:

`gcc`	serial/OpenMP C
`g++`	serial/OpenMP C++
`g77`	serial/OpenMP Fortran 77
`gfortran`	serial/OpenMP Fortran 77/90/95
`mpicc, mpigcc`	script for C with MPI
`mpiCC, mpig++`	script for C++ with MPI
`mpif77`	scripts for Fortran 77 with MPI
`mpigfortran`	scripts for Fortran 77/90/95 with MPI

Common / Useful Options:

The GNU compilers offer over 1,200 options! A few useful compiler options are shown below. Interested users will definitely want to consult relevant man page(s) and GNU documentation.

Option Description
-I, -L, -l, -lm, -c, -o As usual
-ansi Adhere to ANSI rules.
-fopenmp Compile with OpenMP support.
-fmem-report -ftime-report -Q Display to stdout some statistics (memory allocations, time used, etc.) after the compilation is completed.
-ftree-vectorize -ftree-vectorizer-verbose=0-6 Turn on vectorization (use with -O1, -O2 or -O3)
Generate vectorization report - levels 0 (none) to 6 (max info).
-g -gstring Produce debugging information. string specifies a particular type of debugging information/format
-lacml Link with the AMD Core Math Library for optimized BLAS, LAPACK and FFT routines.
-mavx Utilize the advanced vector extension (AVX) streaming SIMD instructions. Sandy Bridge processor only. Not available with GNU Fortran.
-mcmodel=medium Required if executable is greater than 2 GB. The program is linked in the lower 2 GB of the address space but symbols can be located anywhere in the address space. Programs can be statically or dynamically linked, but building of shared libraries are not supported with the medium model.
-mtune=cpu-type -march=cpu-type Architecture specific tuning options - see man page. Use opteron for LC's Opteron clusters.
-O0, -O, -O1, -O2, -O3 Levels of optimization. -O0 is no optimization (default). -O and -O1 are the same. -O2 and -O3 are increasingly aggressive optimizations.
-p, -pg Generate extra code to write profile information suitable for the analysis program gprof.
-pthread Compile with pthreads support.
-v Verbose mode. Display compiler version, paths, libraries, etc. information.
-Wstring -Wno-string Turn on/off warning messages, where string describes the message type (see the man page)

Option	Description
`-I, -L, -l, -lm, -c, -o`	As usual
`-ansi`	Adhere to ANSI rules.
`-fopenmp`	Compile with OpenMP support.
`-fmem-report -ftime-report -Q`	Display to stdout some statistics (memory allocations, time used, etc.) after the compilation is completed.
`-ftree-vectorize -ftree-vectorizer-verbose=0-6`	Turn on vectorization (use with -O1, -O2 or -O3) Generate vectorization report - levels 0 (none) to 6 (max info).
`-g -gstring`	Produce debugging information. string specifies a particular type of debugging information/format
`-lacml`	Link with the AMD Core Math Library for optimized BLAS, LAPACK and FFT routines.
`-mavx`	Utilize the advanced vector extension (AVX) streaming SIMD instructions. Sandy Bridge processor only. Not available with GNU Fortran.
`-mcmodel=medium`	Required if executable is greater than 2 GB. The program is linked in the lower 2 GB of the address space but symbols can be located anywhere in the address space. Programs can be statically or dynamically linked, but building of shared libraries are not supported with the medium model.
`-mtune=cpu-type -march=cpu-type`	Architecture specific tuning options - see man page. Use opteron for LC's Opteron clusters.
`-O0, -O, -O1, -O2, -O3`	Levels of optimization. -O0 is no optimization (default). -O and -O1 are the same. -O2 and -O3 are increasingly aggressive optimizations.
`-p, -pg`	Generate extra code to write profile information suitable for the analysis program gprof.
`-pthread`	Compile with pthreads support.
`-v`	Verbose mode. Display compiler version, paths, libraries, etc. information.
`-Wstring -Wno-string`	Turn on/off warning messages, where string describes the message type (see the man page)

Linux Clusters Overview Exercise 1

Getting Started

Overview:

Login to an LC cluster using your workshop username and OTP token
Copy the exercise files to your home directory
Familiarize yourself with the cluster's configuration
Familiarize yourself with available compilers
Build and run serial applications
Compare compiler optimizations

GO TO THE EXERCISE HERE

MPI

This section discusses general MPI usage information for LC's Linux clusters. For information on MPI programming, please consult the LC MPI tutorial.

MVAPICH

General Info:

MVAPICH MPI is developed and supported by the Network-Based Computing Lab at Ohio State University.
Available on all of LC's Linux clusters.
MVAPICH 1.2
- Default version of MPI (as of June 2016)
- MPI-1 implementation that also includes support for MPI-I/O
- Based on MPICH-1.2.7 MPI library from Argonne National Laboratory
- Not thread-safe. All MPI calls should be made by the master thread in a multi-threaded MPI program.
- See /usr/local/docs/mpi.mvapich.basics for LC usage details.
MVAPICH2
- Multiple versions available
- MPI-2 and MPI-3 implementations based on MPICH MPI library from Argonne National Laboratory. Versions 1.9 and later implement MPI-3 according to the developer's documentation.
- Not currently the default - requires the "use" command to load the selected dotkit package:
```
use -l mvapich             (list available packages)
use mvapich2-intel-2.1     (use the package of interest)
```
- Thread-safe
- See /usr/local/docs/mpi.mvapich2.basics for LC usage details.

Compiling:

See the MPI Build Scripts table below.

Running:

MPI executables are launched using the SLURM srun command with the appropriate options. For example, to launch an 8-process MPI job split across two different nodes in the pdebug pool:
```
srun -N2 -n8 -ppdebug a.out
```
The srun command is discussed in detail in the Running Jobs section of the Linux Clusters Overview tutorial.

Documentation:

MVAPICH home page: mvapich.cse.ohio-state.edu/
MVAPICH2 User Guides: http://mvapich.cse.ohio-state.edu/userguide/
MVAPICH 1.2 User Guide: available HERE
MPICH home page: http://www.mpich.org/
/usr/local/docs on LC's clusters: mpi.basics mpi.mvapich.basics mpi.mvapich2.basics

Open MPI

General Information:

Open MPI is a thread-safe, open source MPI implementation developed and supported by a consortium of academic, research, and industry partners.
Available on all LC Linux clusters. However, you'll need to load the desired dotkit package with the use command. For example:
```
use -l openmpi              (list available packages)
use openmpi-gnu-1.8.4       (use the package of interest)
```
This ensures that LC's MPI wrapper scripts point to the desired version of Open MPI.
See /usr/local/docs/mpi.openmpi.basics for LC usage details.

Compiling:

See the MPI Build Scripts table below.

Running:

Be sure to load the same Open MPI dotkit package that you used to build your executable. If you are running a batch job, you will need to load the dotkit package in your batch script.
Launching an Open MPI job is done differently than with MVAPICH MPI - the mpiexec command is required. For example, to run a 48 process MPI job:
```
mpiexec -np 48 a.out
```

Documentation:

Open MPI home page: http://www.open-mpi.org/
/usr/local/docs/openmpi.basics on LC's clusters:

MPI Build Scripts

LC developed MPI compiler wrapper scripts are used to compile MPI programs
Automatically perform some error checks, include the appropriate MPI #include files, link to the necessary MPI libraries, and pass options to the underlying compiler.

For MPICH2 and Open MPI, you must first load the desired dotkit package with the use command. For example:

use -l openmpi              (list available packages)
use openmpi-gnu-1.8.4       (use the package of interest)

Failing to do this will result in getting the MVAPICH 1.2 implementation.

For additional information:

See the man page (if it exists)
Issue the script name with the -help option
View the script yourself directly

MPI Build Scripts
Implementation Language Script Name Underlying Compiler
MVAPCH 1.2 C mpicc gcc - GNU
mpigcc gcc - GNU
mpiicc icc - Intel
mpipgcc pgcc - PGI
C++ mpiCC g++ - GNU
mpig++ g++ - GNU
mpiicpc icpc - Intel
mpipgCC pgCC - PGI
Fortran mpif77 g77 - GNU
mpigfortran gfortran - GNU
mpiifort ifort - Intel
mpipgf77 pgf77 - PGI
mpipgf90 pgf90 - PGI
MVAPCH2 C mpicc C compiler of dotkit package loaded
C++ mpicxx C++ compiler of dotkit package loaded
Fortran mpif77 Fortran77 compiler of dotkit package loaded
mpif90 Fortran90 compiler of dotkit package loaded
Open MPI C mpicc C compiler of dotkit package loaded
C++ mpiCC mpic++ mpicxx C++ compiler of dotkit package loaded
Fortran mpif77 Fortran77 compiler of dotkit package loaded
mpif90 Fortran90 compiler of dotkit package loaded

Level of Thread Support

MPI libraries vary in their level of thread support:
- MPI_THREAD_SINGLE - Level 0: Only one thread will execute.
- MPI_THREAD_FUNNELED - Level 1: The process may be multi-threaded, but only the main thread will make MPI calls - all MPI calls are funneled to the main thread.
- MPI_THREAD_SERIALIZED - Level 2: The process may be multi-threaded, and multiple threads may make MPI calls, but only one at a time. That is, calls are not made concurrently from two distinct threads as all MPI calls are serialized.
- MPI_THREAD_MULTIPLE - Level 3: Multiple threads may call MPI with no restrictions.
Consult the MPI_Init_thread() man page for details.

A simple C language example for determining thread level support is shown below.

#include "mpi.h" #include <stdio.h> int main( int argc, char *argv[] ) { int provided, claimed; /*** Select one of the following MPI_Init_thread( 0, 0, MPI_THREAD_SINGLE, &provided; ); MPI_Init_thread( 0, 0, MPI_THREAD_FUNNELED, &provided; ); MPI_Init_thread( 0, 0, MPI_THREAD_SERIALIZED, &provided; ); MPI_Init_thread( 0, 0, MPI_THREAD_MULTIPLE, &provided; ); ***/ MPI_Init_thread(0, 0, MPI_THREAD_MULTIPLE, &provided; ); MPI_Query_thread( &claimed; ); printf( "Query thread level= %d Init_thread level= %d\n", claimed, provided ); MPI_Finalize(); }
Sample output:
Query thread level= 3 Init_thread level= 3

Running Jobs

Overview

Big Differences:

LC's Linux clusters can be divided into two types: those having a high speed interconnect and those that don't. There are significant differences that arise from this.

Interconnect No Interconnect
Clusters Most linux clusters aztec, inca, rzcereal
Parallelism Designated as a parallel resource.
A single job can span many nodes. Designated for serial and single node parallel jobs only.
A job cannot span more than one node.
Node Sharing Compute nodes are NOT shared with other users or jobs.
When your job runs, the allocated nodes are dedicated to your job alone. Multiple users and their jobs can run on the same node simultaneously.
Can result in competition for resources such as CPU and memory.

Usage differences between these two different types of clusters will be noted as relevant in the remainder of this tutorial.

Job Limits:

For all production clusters, there are defined job limits which vary from cluster to cluster. The primary job limits apply to:
- How many nodes/CPUs a job may use
- How long a job may run
- How many jobs may be actively run simultaneously per user
- How many nodes a user may use across all running jobs
Most job limits are enforced by the batch system, such as Moab or SLURM.
Some job limits are enforced by a "good neighbor" policy
The easiest way to determine the job limits for a particular machine is to login to that machine and use the command:
news job.lim.machinename
where machinename is the name of the machine you are logged into.
Job limits are also documented on LC web pages
The LC "Machine Status" web pages: (LC internal). Click on the machine name of interest and then look for the "Batch Limits" link/section.
- OCF-CZ: https://lc.llnl.gov/cgi-bin/lccgi/customstatus.cgi
- OCF-RZ: https://rzlc.llnl.gov/cgi-bin/lccgi/customstatus.cgi
- SCF: https://lc.llnl.gov/cgi-bin/lccgi/customstatus.cgi
Go to mylc.llnl.gov. Click on any machine name in the "My Accounts" section.
Further discussion, and a summary table of job limits for all production machines are available in the Queue Limits section of the Moab tutorial.

Running Jobs

Batch Versus Interactive

Interactive Jobs:

Typically run in the pdebug queue. Some clusters may have additional queues permitting interactive jobs.
Interactive jobs have lower limits:
- Shorter time limit
- Fewer number of nodes permitted
Should NOT be run on login nodes
Can be launched from the command line using the srun command (covered later) or submitted as a job script to the pdebug queue as a batch system using the msub command (below).
May also be submitted to the pbatch queue using the mxterm command.

Batch Jobs:

Most LC production Linux clusters use the Tri-lab Moab workload scheduler as the top level-batch system, with SLURM underneath as the local resource manager.
This section only provides a quick summary of batch usage on Linux clusters. For details, see the Moab Tutorial.
Batch jobs are typically run in the larger pbatch partition.
Some clusters may have additional queues permitting batch jobs.
All batch jobs must be submitted in the form of a job control script with the msub command. Examples:
```
msub myjobscript
msub myjobscript -q ppdebug -a mic
msub myjobscript -l walltime=45:00
```

Example Moab job submission job control script:

#!/bin/csh ##### Example Moab Job Control Script ##### These lines are for Moab #MSUB -l nodes=16 #MSUB -l partition=zin #MSUB -l walltime=2:00:00 #MSUB -A phys #MSUB -q pbatch #MSUB -m be #MSUB -V #MSUB -o /p/lscratch3/joeuser/par_solve/myjob.out ##### These are shell commands # Display job information for possible diagnostic use date echo "Job id = $SLURM_JOBID" hostname sinfo squeue # Run info cd /p/lscratch3/joeuser/par_solve srun -n256 a.out echo 'Done'

After successfully submitting a job, you may then check its progress and interact with it (hold, release, alter, kill) by means of other batch commands briefly listed below.
Interactive debugging of batch jobs is possible (covered later).

Quick Summary of Common Batch Commands:

For details see the Moab tutorial and man pages.

Moab Command Description
msub Submits a job to Moab
showq, checkjob, mdiag -j Check job status, monitor job information
mjobctl -c, canceljob Remove a running or queued job
mjobctl -h Place a queued job on hold
mjobctl -u Release a held job
mjobctl -m Modify job attributes (limited subset)
mshare, mdiag -u View your available banks
mdiag -t View batch configuration information
mdiag -p View job priorities

Running Jobs

Starting Jobs - srun

The srun command:

The SLURM srun command is required to launch parallel jobs - both batch and interactive.
It should also be used to launch serial jobs in the pdebug and other interactive queues.
Syntax:
srun [option list] [executable] [args]
Note that srun options must precede your executable.
Interactive use example, from the login node command line. Specifies 2 nodes (-N), 16 tasks (-n) and the interactive pdebug partition (-p):
% srun -N2 -n16 -ppdebug myexe

Batch use example requesting 16 nodes and 256 tasks (assumes nodes have 16 cores):

First create a job script that requests nodes via #MSUB -l nodes and uses srun to specify the number of tasks and launch the job.
#!/bin/csh #MSUB -l nodes=16 #MSUB -l partition=zin #MSUB -l walltime=2:00:00 #MSUB -q pbatch # Run info and srun job launch cd /p/lscratch3/joeuser/par_solve srun -n256 a.out echo 'Done'

Then submit the job script from the login node command line:
% msub myjobscript

Primary differences between batch and interactive usage:

Difference Interactive Batch
Where used: From login node command line In batch script
Partition: Requires specification of an interactive partition, such as pdebug with the -p flag pbatch is default
Scheduling: If there are available interactive nodes, job will run immediately. Otherwise, it will queue up (fifo) and wait until there are enough free nodes to run it. The batch scheduler handles when to run your job regardless of the number of nodes available.

More Examples:

srun -n64 -ppdebug my_app
64 process job run interactively in pdebug partition

srun -N64 -n512 my_threaded_app
512 process job using 64 nodes. Assumes pbatch partition.

srun -N4 -n32 -c2 my_threaded_app
2 node, 32 process job with 2 cores (threads) per process. Assumes pbatch partition.

srun -N8 my_app
8 node job with a default value of one task per node (8 tasks). Assumes pbatch partition.

srun -n128 -o my_app.out my_app
128 process job that redirects stdout to file my_app.out. Assumes pbatch partition.

srun -n32 -ppdebug -i my.inp my_app
32 process interactive job; each process accepts input from a file called my.inp instead of stdin

Behavior of srun -N and -n flags - using 4 nodes in batch (#MSUB -l nodes=4), each of which has 16 cores:

srun options:

srun is a powerful command with @100 options affecting a wide range of job parameters.
Many srun options may be set via @60 SLURM environment variables. For example, SLURM_NNODES behaves like the -N option.

A short list of common srun options appears below. srun man page for details on options and environment variables.

Option Description

--auto-affinity=[args]
Specifies how to bind tasks to cpus. Discussion available HERE.

-c [#cpus/task]
The number of CPUs used by each process. Use this option if each process in your code spawns multiple POSIX or OpenMP threads.

-d
Specify a debug level - integer value between 0 and 5

-i [file] -o [file]
Redirect input/output to file specified

-I
Allocate CPUs immediately or fail. By default, srun blocks until resources become available.

-J
Specify a name for the job

-l
Label - prepend task number to lines of stdout/err

-m block|cyclic
Specifies whether to use block (the default) or cyclic distribution of processes over nodes

--multi-prog config_file
Run a job with different programs with/without different arguments for each task. Intended for MPMD (multiple program multiple data) model MPI programs. Discussion available HERE.

-n [#processes]
Number of processes that the job requires

-N [#nodes]
Number of nodes on which to run job

-O
Overcommit - srun will refuse to allocate more than one process per CPU unless this option is also specified

-p [partition]
Specify a partition (cluster) on which to run job

-s
Print usage stats as job exits

-v -vv -vvv
Increasing levels of verbosity

-V
Display version information

Clusters Without an Interconnect - Additional Notes:

The aztec, inca and rzcereal clusters fall into this category.
Don't forget that multiple users and jobs can run on a single node
For this reason, it is very important that you tell the Moab scheduler how many cores your job requires:
- For batch jobs, be sure to include in your job script:
  #MSUB -l ttc=#cores
- For interactive jobs, be sure to use the srun -c flag (described above table)
MPI jobs: these are started with the srun command as described above, but because there is no interconnect, jobs are limited to a single node and communications are done in shared memory.
Pthreads and OpenMP jobs can be run as usual - keeping in mind the required number of cores per node, and that a node may be shared with other users.
Important: Please see "Running on the Aztec and Inca Clusters" section of the Moab tutorial for further discussion on running jobs on these clusters.

Running Jobs

Terminating Jobs

Interactive:

Interactive srun jobs launched from the command line should normally be terminated with a SIGINT (CTRL-C):
- The first CTRL-C will report the state of the tasks
- A second CTRL-C within one second will terminate the tasks
- If you started your job with srun's -q option, only one CTRL-C is required.
WARNING: Do not use kill -9 to stop srun, as it may not stop the parallel processes it spawned. Because these processes are running on other nodes, ps -ef will not show them after you terminate the srun process, but they will still be there, using resources.

Batch:

For batch jobs, the mjobctl -c and canceljob commands can be used to terminate batch jobs. More info HERE and an example below:

% showq | grep jsmith 152832 jsmith Running 2 00:08:58 Wed Aug 24 09:41:24 152833 jsmith Running 2 00:08:58 Wed Aug 24 09:41:24 152834 jsmith Running 2 00:08:58 Wed Aug 24 09:41:24 152835 jsmith Running 2 00:08:58 Wed Aug 24 09:41:24 % mjobctl -c 152832 job '152832' cancelled % canceljob 152833 152834 152835 job '152833' cancelled job '152834' cancelled job '152835' cancelled

The SLURM scancel command can be used to terminate any job (batch or interactive) that was started with the srun command. For example:

% squeue | grep jsmith 152850 pbatch moab.job jsmith PD 0:00 2 (JobHeld) 152849 pbatch moab.job jsmith R 0:08 2 sierra[225-226] 152848 pbatch moab.job jsmith R 0:19 2 sierra[227-228] % scancel 152850

Running Jobs

Displaying Queue and Job Status Information

Overview:

Only a brief summary is provided here, and most commands have a number of options. Please see the Monitoring Jobs section in the Moab tutorial for details and for additional commands.

showq - A Moab command that displays all running, idle and blocked jobs across all machines in a Moab grid. Options exist to filter a subset of these jobs. The example below shows only running jobs on the cab cluster.

% showq -r -p cab active jobs------------------------ JOBID S PAR EFFIC XFACTOR Q USERNAME ACCNT MHOST NODES REMAINING STARTTIME 892013 R cab 0.00 6.0 no x2wwwta1 hph cab167 4 00:00:16 Wed May 28 17:23:38 892014 R cab 0.00 6.0 no x2wwwta1 hph cab320 4 00:07:58 Wed May 28 17:31:20 892016 R cab 0.00 6.0 no deer456x hph cab225 4 00:07:58 Wed May 28 17:31:20 892007 R cab 0.00 6.0 no uu45r99b hph cab159 4 00:07:58 Wed May 28 17:31:20 ... [ output omitted here ] ... 897911 R cab 0.00 1.0 no g99y58 axvnv cab273 4 15:13:50 Thu May 29 08:37:12 897914 R cab 0.00 1.0 no g99y58 axvnv cab426 4 15:19:04 Thu May 29 08:42:26 897934 R cab 0.00 1.0 no l8nk6 zpinch cab56 13 15:20:07 Thu May 29 08:43:29 897935 R cab 0.00 1.0 no l8nk6 zpinch cab68 13 15:38:34 Thu May 29 09:01:56 897921 R cab 0.00 1.0 no be432min aprod cab135 16 15:47:10 Thu May 29 09:10:32 100 active jobs 19088 of 19648 processors in use by local jobs (97.15%) 1225 of 1228 nodes active (99.76%) Total jobs: 100

checkjob - A Moab command that displays detailed job state information and diagnostic output for a selected job.

% checkjob 897844 job 897844 AName: in2d812 State: Running Creds: user:ben467in group:ben467in account:aprod class:pbatch qos:normal WallTime: 1:09:31 of 16:00:00 SubmitTime: Thu May 29 08:21:44 (Time Queued Total: 00:00:41 Eligible: 00:00:00) StartTime: Thu May 29 08:22:25 Job Templates: fs TemplateSets: fs.set NodeMatchPolicy: EXACTNODE Total Requested Tasks: 240 Req[0] TaskCount: 256 Partition: cab Dedicated Resources Per Task: PROCS: 1 lscratchb: 1 lscratchc: 1 lscratchd: 1 lscratche: 1 NodeCount: 16 Allocated Nodes: cab[49,71,89,124,249,509,580,618,755,969,1021,1071,1085,1209,1230,1272]*16 SystemID: czmoab SystemJID: 897844 IWD: /p/lscratche/ben467in/geodyn/Bill_TR/in2d812 SubmitDir: /p/lscratche/ben467in/geodyn/Bill_TR/in2d812 Executable: /var/opt/moab/spool/moab.job.PjW1y0 StartCount: 1 Partition List: cab Flags: PREEMPTOR,GLOBALQUEUE Attr: fs.set Variables: RHOME=/g/g21/benjamin StartPriority: 0 Task Range: 1 -> 99999 PE: 256.00 Reservation '897844' (-1:10:14 -> 14:49:46 Duration: 16:00:00)

mdiag -j - A Moab command that displays one line of information for running and queued (idle) jobs across a Moab grid.

% mdiag -j JobID State Proc WCLimit User Opsys Class Features 896851 Running 64 16:00:00 gggel2 - pbatch [cab] 896103 Idle 32 00:15:00 schert52 - pbatch - 892290 Idle 12 12:00:00 bbbley2 - pbatch [cab] 896141 Idle 32 00:15:00 schert52 - pbatch - 833789 User 160 16:00:00 zqweg30 - pbatch - 888242 Running 12 8:08:00:00 mandy - pbatch [aztec] ... [ output omitted here ] ... 898134 Running 1 12:00:00 kwreol3 - pbatch - 898183 Running 1 1:00:00:00 sn6yur35 - pbatch - 898301 Running 1 5:00:00 mcukjun - pbatch - 898128 Idle 16 16:00:00 qiwe4 - pbatch - 896059 Idle 32 00:15:00 sssaich2 - pbatch - 892291 Idle 12 12:00:00 w23ley2 - pbatch [cab] 897923 Running 128 12:00:00 htrobiev - pbatch -

ju - An LC command that shows partitions and job usage in a concise format. In the example below, some output has been truncated due to its width.

% ju Partition total down used avail cap Jobs pdebug 64 0 55 9 86% bbnliu-16, pmorris-1, rreed-8.... pbatch 1048 5 818 225 78% ggk-1, vo4-320, griinman-25, ...

mjstat - An LC command showing a cluster's running job queue. For example:

% mjstat Scheduling pool data: ----------------------------------------------------------- Pool Memory Cpus Total Usable Free Other Traits ----------------------------------------------------------- pdebug 2000Mb 8 16 16 7 (null) pbatch* 2000Mb 8 255 254 16 (null) Running job data: ------------------------------------------------------------------- JobID User Nodes Pool Status Used Master/Other ------------------------------------------------------------------- 15936 a3brg2 8 pbatch CG 6:00:42 sierra44 15662 turh 1 pbatch CG 11:48:06 sierra24 16316 g6asco 8 pbatch PD 0:00 (JobHeld) 14353 r4dd 16 pbatch PD 0:00 (JobHeld) 15922 turh 8 pbatch R 4:49:47 sierra57 ... [ output omitted here ] ... 164331 je44g 2 pdebug R 18:20 sierra8 164332 je44g 2 pdebug R 15:57 sierra16

sinfo - SLURM command for displaying configuration and usage information. The "infinite" time limit for pbatch is set because Moab is used to control the actual job time limit, which may vary.

% sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST pdebug up 30:00 9 alloc cab[11-19] pdebug up 30:00 7 idle cab[4-10] pbatch* up infinite 7 drain* cab[21,154,793,845,856,858,862] pbatch* up infinite 1 drain cab340 pbatch* up infinite 792 alloc cab[20,22-153,155-275,300-339,341-551,556-575,588-792,794-839,844, 846-855,857,859-861,863]

squeue - SLURM command for displaying information about running jobs. In the example below, some output has been truncated due to its width.

% squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 31250 pbatch RSRMV_3. rfffdder R 9:16:52 32 sierra[173-181... 30677 pbatch relax_Va a4ebb2 R 9:16:38 40 sierra[20-27,58-65... ... [ output omitted here ] ... 30724 pbatch 6size.0. bccce R 4:29:33 64 sierra[28-38,87-98,... 29170 pbatch ttest5a. re5uuh R 4:23:02 16 sierra[66-81]

Running Jobs

Optimizing CPU Usage

Clusters with an Interconnect:

Fully utilizing the cores on a node requires that you use the right combination of srun and Moab options, depending upon what you want to do and which type of machine you are using.
MPI only: for example, if you are running on a cluster that has 16 cores per node, and you want your job to use all 16 cores on 4 nodes (16 MPI tasks per node), then you would do something like:
Interactive Moab Batch
srun -n64 -ppdebug a.out

#MSUB -l nodes=4 srun -n64 a.out
MPI with Threads: If your MPI job uses POSIX or OpenMP threads within each node, you will need to calculate how many cores will be required in addition to the number of tasks. For example, running on a cluster having 16 cores per node, an 8-task job where each task creates 4 OpenMP threads, would need a total of 32 cores, or 2 nodes:
8 tasks * 4 threads / 16 cores/node = 2 nodes
Interactive Moab Batch
srun -N2 -n8 -ppdebug a.out

#MSUB -l nodes=2 srun -N2 -n8 a.out
Don't forget that the default MPI on LC's Linux clusters is not thread-safe, so the master thread must perform all MPI calls.
You can include multiple srun commands within your batch job command script. For example, suppose that you were conducting a scalability run on an 16 core/node Linux cluster. You could allocate the maximum number of nodes that you would use with #MSUB -l nodes= and then have a series of srun commands that use varying numbers of nodes:
#MSUB -l nodes=8 srun -N1 -n16 myjob srun -N2 -n32 myjob srun -N3 -n48 myjob .... srun -N8 -n128 myjob

Clusters Without an Interconnect:

The issues for CPU utilization are different for clusters without a switch for several reasons:
- Jobs can be serial
- Nodes may be shared with other users
- Heavy utilization
- Jobs are limited to one node
On these systems, the more important issue is over utilization of cores rather than under utilization.
IMPORTANT: For batch jobs, you need to specify how many cores your job will require. The way to do this is with the Moab ttc= option:
```
#MSUB -l ttc=#cpus
```
This allows Moab to effectively schedule the node for other jobs if you don't use all of a node's cores.
This is particularly important for MPI jobs and threaded codes. Failure to do this can/will result in:
- The batch scheduler assuming by default that your job requires only one core.
- Other jobs being scheduled on the node you're using because the scheduler thinks your job only needs one core.
- Poor performance and jobs taking longer to run than expected because of oversubscription. Jobs may be killed because they will take longer than the time limit specified.
- Upsetting other users because you've oversubscribed the node.
- Wasting the time and effort of LC support staff because of upset users calling about unexpected job behavior.
Important notes for Aztec and Inca users: Please see "Running on the Aztec and Inca Clusters" section of the Moab tutorial for important information about running MPI jobs on these clusters.

Running Jobs

Memory Considerations

64-bit Architecture Memory Limit:

Because LC's Linux clusters employ a 64-bit architecture, 16 exabytes of memory can be addressed - which is about 4 billion times more than 4 GB limit of 32-bit architectures. By current standards, this is virtually unlimited memory.
In reality, machines are usually configured with only GBs of memory, so any address access that exceeds physical memory will result (on most systems) with paging and degraded performance.
However, LC machines are an exception to this because they have no local disk - see below.

LC's Diskless Linux Clusters:

LC's Linux clusters are configured with diskless compute nodes. This has very important implications for programs that exceed physical memory. For example, most compute nodes have 16-64 GB of physical memory.
Because compute nodes don't have disks, there is no virtual (swap) memory, which means there is no paging. Programs that exceed physical memory will terminate with an OOM (out of memory) error and/or segmentation fault.

Compiler, Shell, Pthreads and OpenMP Limits:

Compiler data structure limits are in effect, but may be handled differently by different compilers.
Shell stack limits: most are set to "unlimited" by default.
Pthreads stack limits apply, and may differ between compilers.
OpenMP stack limits apply, and may differ between compilers.

MPI Memory Use:

All MPI implementations require additional memory use. This varies between MPI implementations and between versions of any given implementation.
The amount of memory used increases with the number of MPI tasks.
Determining how much memory the MPI library uses can be accomplished by using various tools, such as TotalView's Memscape feature.

Large Static Data:

If your executable contains a large amount of static data, LC recommends that you compile with the -mcmodel=medium option.
This flag allows the Data and .BSS sections of your executable to extend beyond a default 2 GB limit.
True for all compilers (see the respective compiler man page).

Opterons Only: NUMA and Infiniband Adapter Considerations:

Each Opteron node has 4 Socket F processors - either dual-core or quad-core. Each Socket F is directly connected to its own local DIMMs with a bandwidth of 10.7 GB/s.

Shared memory access to the DIMMs of other sockets occurs over the HyperTransport links at 8 GB/s (4 GB/s each direction).

To improve performance, LC "pins" MPI tasks to a specific CPU. This promotes memory locality and reduction of shared memory accesses across multiple HyperTransport links to DIMMs on other sockets.

Only 1 socket is directly connected to the Infiniband adapter. This means that communications from other sockets need to traverse more than one HyperTransport link to get to the IB adapter, with the possibility of degraded bandwidth. Unofficial tests suggest up to 30%.
Opteron memory and Infiniband schematic

Running Jobs

Vectorization and Hyper-threading

Vectorization:

All of LC's Linux clusters and their compilers (Intel, PGI, GNU) support Intel's Streaming SIMD Extensions (SSE, SSE2, SSE3...) vector instructions.
The primary purpose of these instructions is to increase CPU throughput by performing operations on vectors of data elements, rather than on single data elements. For example:
Sandy Bridge-EP (TLCC2) processors, and later models, have 256-bit vector registers designed to take advantage of Intel's Advanced Vector Extension (AVX) instructions. AVX instructions operate on 8 single precision or 4 double precision operands.
The vector registers on older processors are 128-bits, and can hold four single precision, or two double precision operands.
To take advantage of the potential performance improvements offered by vectorization, all you need to do is compile with the appropriate compiler flags. Some recommendations are shown in the table below.
Compiler Vector Flag AVX Flag Reporting
Intel -vec (default) -avx
-opt_report
-vec_report
PGI
-fast
n/a
-Minfo=all
GNU
-O3
-mavx (C/C++ only)
-ftree-vectorizer-verbose=1
Vectorization is performed on eligible loops. Note that not all loops are able to be vectorized. A number of factors can prevent the compiler from vectorizing a loop, such as:
- Function calls
- I/O operations
- GOTOs in or out of the loop
- Recurrences
- Data dependence, such as needing a value from a previous loop iteration
- Complex coding (difficult loop analysis)

To view/confirm that a loop has been vectorized, you can generate an assembler file and look at the instructions used. For example:

% icc -S -vec_report myprog.c myprog.c(5): (col. 1) remark: LOOP WAS VECTORIZED. myprog.c(13): (col. 1) remark: LOOP WAS VECTORIZED. % cat myprog.s ... ... .L8: movdqa %xmm3, %xmm1 .L3: movdqa %xmm1, %xmm3 cvtdq2pd %xmm1, %xmm0 pshufd $238, %xmm1, %xmm1 addpd %xmm2, %xmm0 paddd %xmm6, %xmm3 cvtdq2pd %xmm1, %xmm1 addpd %xmm2, %xmm1 cvtpd2ps %xmm0, %xmm0 cvtpd2ps %xmm1, %xmm1 movlhps %xmm1, %xmm0 movaps %xmm4, %xmm1 movaps %xmm0, (%rcx,%rax) mulps %xmm5, %xmm0 divps %xmm0, %xmm1 movaps %xmm0, (%rdx,%rax) movaps %xmm1, (%rsi,%rax) addq $16, %rax cmpq $400, %rax jne .L8 xorl %eax, %eax ... ...

Most of the instructions shown above are SIMD instructions using the SIMD xmm registers. For a listing of SSE and AVX instructions, see en.wikipedia.org/wiki/X86_instruction_listings#SIMD_instructions

Hyper-threading:

On Intel processors, hyper-threading enables 2 hardware threads per core. Hyper-threading is Intel's terminology for a more general technique known as simultaneous multi-threading (SMT), which enables more than one hardware thread to run concurrently on one core.
LC's Linux clusters have Intel's hyper-threading feature enabled, but it is turned "off" by default. To use this feature, you must explicitly turn hyper-threading on using the SLURM --enable-hyperthreads option of the srun command.

Example: on a 16-core node, check to see if hyper-threading is turned on - it isn't by default, so only 16 cores are shown. Then execute the same command using srun's --enable-hyperthreads flag. Now 32 "cores" appear.

% srun -n1 -ppdebug /usr/sbin/hyperthread-control --report hyperthread-control: 16/32 CPUs enabled. % srun -n1 -ppdebug --enable-hyperthreads /usr/sbin/hyperthread-control --report hyperthread-control: 32/32 CPUs enabled.

Note: If using mxterm to acquire your job's nodes, you will need to enable hyperthreads by using the srun --enable-hyperthreads flag when you launch your job in the mxterm window.
Hyper-threading benefits some codes more than others. Tests performed on some LC codes (pF3D, IMC, Ares) showed improvements in the 10-30% range. Your mileage may vary.
Detailed LC documentation on using hyper-threading on TLCC2 machines can be found at: lc.llnl.gov/confluence/display/TLCC2/Hyper-threading (authentication required).

Running Jobs

Process and Thread Binding

Process/Thread Binding to Cores:

By default, jobs run on LC's Linux clusters have their processes bound to the available cores on a node. For example:

% srun -l -n2 numactl -s | grep physc | sort 0: physcpubind: 0 1 2 3 4 5 6 7 1: physcpubind: 8 9 10 11 12 13 14 15

% srun -l -n4 numactl -s | grep physc | sort 0: physcpubind: 0 1 2 3 1: physcpubind: 4 5 6 7 2: physcpubind: 8 9 10 11 3: physcpubind: 12 13 14 15

% srun -l -n8 numactl -s | grep physc |sort 0: physcpubind: 0 1 1: physcpubind: 2 3 2: physcpubind: 4 5 3: physcpubind: 6 7 4: physcpubind: 8 9 5: physcpubind: 10 11 6: physcpubind: 12 13 7: physcpubind: 14 15

Binding processes to cores may improve performance by keeping cache local to cores.
If a process is multi-threaded (such as with OpenMP), the threads will run on any of the cores bound to their process.
To bind an OpenMP thread to a single core, the OMP_PROC_BIND environment variable can be used (set to "TRUE").

Additionally, LC provides a couple useful utilities for binding processes and threads to cores:

mpibind - automatically binds processes and threads to cores. Documentation at https://lc.llnl.gov/confluence/display/TLCC2/mpibind (requires authentication)

mpifind - reports on how processes and threads are bound to cores. Documentation at https://lc.llnl.gov/confluence/display/TLCC2/mpifind (requires authentication).
Example output for a 4-process job, each with 4 threads. Output on the left is with OMP_PROC_BIND unset, and on the right, with OMP_PROC_BIND=TRUE.

cab690% mpifind 777601 HOST RANK ID LCPU AFFINITY CPUMASK CMD cab24 0 98152 3 0-3 0000000f a.out 98160 2 0-3 0000000f a.out 98163 1 0-3 0000000f a.out 98166 0 0-3 0000000f a.out 98173 2 0-3 0000000f a.out cab24 1 98153 7 4-7 000000f0 a.out 98159 4 4-7 000000f0 a.out 98161 5 4-7 000000f0 a.out 98165 6 4-7 000000f0 a.out 98171 4 4-7 000000f0 a.out cab24 2 98154 11 8-11 00000f00 a.out 98162 8 8-11 00000f00 a.out 98167 9 8-11 00000f00 a.out 98170 10 8-11 00000f00 a.out 98174 8 8-11 00000f00 a.out cab24 3 98155 15 12-15 0000f000 a.out 98164 12 12-15 0000f000 a.out 98168 13 12-15 0000f000 a.out 98169 14 12-15 0000f000 a.out 98172 12 12-15 0000f000 a.out

cab690% mpifind 777571 HOST RANK ID LCPU AFFINITY CPUMASK CMD cab28 0 151874 0 0 00000001 a.out 151881 0 0-3 0000000f a.out 151883 1 1 00000002 a.out 151886 2 2 00000004 a.out 151892 3 3 00000008 a.out cab28 1 151875 4 4 00000010 a.out 151882 5 4-7 000000f0 a.out 151884 5 5 00000020 a.out 151885 6 6 00000040 a.out 151887 7 7 00000080 a.out cab28 2 151876 8 8 00000100 a.out 151888 9 8-11 00000f00 a.out 151890 9 9 00000200 a.out 151893 10 10 00000400 a.out 151896 11 11 00000800 a.out cab28 3 151877 12 12 00001000 a.out 151889 13 12-15 0000f000 a.out 151891 13 13 00002000 a.out 151894 14 14 00004000 a.out 151895 15 15 00008000 a.out

Debugging

This section only touches on selected highlights. For more information users will definitely need to consult the relevant documentation mentioned below. Also, please consult the "Supported Software and Computing Tools" web page located at computing.llnl.gov/code/content/software_tools.php.

TotalView:

TotalView is probably the most widely used debugger for parallel programs. It can be used with C/C++ and Fortran programs and supports all common forms of parallelism, including pthreads, openMP, MPI, accelerators and GPUs.

Starting TotalView for serial codes: simply issue the command:

totalview myprog

Starting TotalView for interactive parallel jobs:

Some special command line options are required to run a parallel job through TotalView under SLURM. You need to run srun under TotalView, and then specify the -a flag followed by 1)srun options, 2)your program, and 3)your program flags (in that order). The general syntax is:
totalview srun -a -n #processes -p pdebug myprog [prog args]

To debug an already running interactive parallel job, simply issue the totalview command and then attach to the srun process that started the job.

Debugging batch jobs is covered in LC's TotalView tutorial and in the "Debugging in Batch" section below.

Documentation:

LC Tutorial: computing.llnl.gov/tutorials/totalview
Vendor website: http://www.roguewave.com/

Small TotalView screen shot
DDT:

DDT stands for "Distributed Debugging Tool", a product of Allinea Software Ltd.

DDT is a comprehensive graphical debugger designed specifically for debugging complex parallel codes. It is supported on a variety of platforms for C/C++ and Fortran. It is able to be used to debug multi-process MPI programs, and multi-threaded programs, including OpenMP.

Currently, LC has a limited number of fixed and floating licenses for OCF and SCF Linux machines.

Usage information: see LC's DDT Quick Start information located at: https://computing.llnl.gov/?set=code&page;=ddt

Documentation:

Vendor website: http://www.allinea.com
Local copies of documentation under /usr/local/docs/ddt

Small ddt screen shot

STAT - Stack Trace Analysis Tool:

The Stack Trace Analysis Tool gathers and merges stack traces from a parallel application's processes.
STAT is particularly useful for debugging hung programs.
It produces call graphs: 2D spatial and 3D spatial-temporal
- The 2D spatial call prefix tree represents a single snapshot of the entire application (see image).
- The 3D spatial-temporal call prefix tree represents a series of snapshots from the application taken over time.
In these graphs, the nodes are labeled by function names. The directed edges, showing the calling sequence from caller to callee, are labeled by the set of tasks that follow that call path. Nodes that are visited by the same set of tasks are assigned the same color, giving a visual reference to the various equivalence classes.
This tool should be in your default path as:
- /usr/local/bin/stat-gui - GUI
- /usr/local/bin/stat-cl - command line
- /usr/local/bin/stat-view - viewer for DOT format output files
- /usr/local/tools/stat - install directory, documentation
More information: see the STAT web page at: computing.llnl.gov/code/STAT and the STAT man page.

Other Debuggers:

Several other common debuggers are available on LC Linux clusters, though they are not recommended for parallel programs when compared to TotalView and DDT.
PGDBG: the Portland Group Compiler debugger. Documentation: https://www.pgroup.com/products/pgdbg.htm
GDB: GNU GDB debugger, a command-line, text-based, single process debugger. Documentation: http://www.gnu.org/software/gdb
DDD: GNU DDD debugger is a graphical front-end for command-line debuggers such as GDB, DBX, WDB, Ladebug, JDB, XDB, the Perl debugger, the bash debugger, or the Python debugger. Documentation: http://www.gnu.org/software/ddd

Debugging in Batch: mxterm:

Debugging batch parallel jobs on LC production clusters is fairly straightforward. The main idea is that you need to submit a batch job that gets your partition allocated and running.
Once you have your partition, you can login to any of the nodes within it, and then starting running as though your in the interactive pdebug partition.
For convenience, LC has developed a utilty called mxterm which makes the process even easier.
How to use mxterm:
1. If you are on a Windows PC, start your X11 application (such as X-Win32)
2. Make sure you enable X11 tunneling for your ssh session
3. ssh and login to your cluster
4. Issue the command as follows:
  mxterm #nodes #tasks #minutes
  Where:
  #nodes = number of nodes your job requires
  #tasks = number of tasks your job requires
  #minutes = how long you need to keep your partition for debugging
5. This will submit a batch job for you that will open an xterm when it starts to run.
6. After the xterm appears, cd to the directory with your source code and begin your debug session.
7. This utility does not have a man page, however you can view the usage information by simple typing the name of the command.

Core Files:

It is quite likely that your shell's core file size setting may limit the size of a core file so that it is inadequate for debugging, especially with TotalView.

To check your shell's limit settings, use either the limit (csh/tcsh) or ulimit -a (sh/ksh/bash) command. For example:

% limit cputime unlimited filesize unlimited datasize unlimited stacksize unlimited coredumpsize 16 kbytes memoryuse unlimited vmemoryuse unlimited descriptors 1024 memorylocked 7168 kbytes maxproc 1024 $ ulimit -a address space limit (kbytes) (-M) unlimited core file size (blocks) (-c) 32 cpu time (seconds) (-t) unlimited data size (kbytes) (-d) unlimited file size (blocks) (-f) unlimited locks (-L) unlimited locked address space (kbytes) (-l) 7168 nofile (-n) 1024 nproc (-u) 1024 pipe buffer size (bytes) (-p) 4096 resident set size (kbytes) (-m) unlimited socket buffer size (bytes) (-b) 4096 stack size (kbytes) (-s) unlimited threads (-T) not supported process size (kbytes) (-v) unlimited

To override your default core file size setting, use one of the following commands:
csh/tcsh unlimit
-or-
limit coredumpsize 64
sh/ksh/bash
ulimit -c 64
Some users have complained that for many-process jobs, they actually don't want core files or only want small core files because normal core files can fill up their disk space. The limit (csh/tcsh) or ulimit -c (sh/ksh/bash) commands can be used as shown above to set smaller / zero sizes.

A Few Additional Useful Debugging Hints:

Add the sinfo and squeue commands to your batch scripts to assist in diagnosing problems. In particular, these commands will document which nodes your job is using.
Also add the -l option to your srun command so that output statements are prepended with the task number that created them.
Be sure to check the exit status of all I/O operations when reading or writing files in Lustre. This will allow you to detect any I/O problems with the underlying OST servers.
If you know/suspect that there are problems with particular nodes, you can use the srun -x option to skip these nodes. For example:
srun -N12 -x "cab40 cab41" -ppdebug myjob

Tools

We Need a Book!

The subject of "Tools" for Linux cluster applications is far too broad and deep to cover here. Instead, a few pointers are being provided for those who are interested in further research.
The first place to check are LC's "Supported Software and Computing Tools" web pages at: computing.llnl.gov/code/content/software_tools.php for what may be available here. Some example tools are listed below.

Memory Correctness Tools:

Memcheck: Valgrind's Memcheck tool detects a comprehensive set of memory errors, including reads and writes of unallocated or freed memory and memory leaks.
TotalView: Allows you to stop execution when heap API problems occur, list memory leaks, paint allocated and deallocated blocks, identify dangling pointers, hold onto deallocated memory, graphically browse the heap, identify the source line and stack backtrace of an allocation or deallocation, summarize heap use by routine, filter and dump heap information, and review memory usage by process or by library.
memP: The memP tool provides heap profiling through the generation of two reports: a summary of the heap high-water-mark across all processes in a parallel job as well as a detailed task-specific report that provides a snapshot of the heap memory currently in use, including the amount allocated at specific call sites.
Intel Inspector: Primarily a thread correctness tool, but memory debugging features are included.

Profiling, Tracing and Performance Analysis:

Open|SpeedShop

Open|SpeedShop is a comprehensive performance tool set with a unified look and feel that covers most important performance analysis steps. It offers various different interfaces, including a flexible GUI, a scripting interface, and a Python class. Supported experiments include profiling using PC sampling, inclusive and exclusive execution time analysis, hardware counter support, as well as MPI, I/O, and floating point exception tracing. All analysis is applied on unmodified binaries and can be used on codes with MPI and/or thread parallelism.

TAU

TAU is a robust profiling and tracing tool from the University of Oregon that includes support for MPI and OpenMP. TAU provides an instrumentation API, but source code can also be automatically instrumented and there is support for dynamic instrumentation as well. TAU is generally viewed as having a steep learning curve, but experienced users have applying the tool with good results at LLNL. TAU can be configured with many feature combinations. If the features you are interested in are not available in the public installation, please request the appropriate configuration through the hotline. TAU developer response is excellent, so if you are encountering a problem with TAU, there is a good chance it can be quickly addressed.

HPCToolkit

HPCToolkit is an integrated suite of tools for measurement and analysis of program performance on computers ranging from multicore desktop systems to the largest supercomputers. It uses low overhead statistical sampling of timers and hardware performance counters to collect accurate measurements of a program's work, resource consumption, and inefficiency and attributes them to the full calling context in which they occur. HPCToolkit works with C/C++/Fortran applications that are either statically or dynamically linked. It supports measurement and analysis of serial codes, threaded codes (pthreads, OpenMP), MPI, and hybrid (MPI + threads) parallel codes.

mpiP

A lightweight MPI profiling library that provides time spent in MPI functions by callsite and stacktrace. This tool is developed and maintained at LLNL, so support and modifications can be quickly addressed. New run-time functionality can be used to generate mpiP data without relinking through the srun-mpip and poe-mpip scripts on Linux and AIX systems.

gprof

Displays call graph profile data. The gprof command is useful in identifying how a program consumes CPU resources. Gprof does simple function profiling and requires that the code be built and linked with -pg. For parallel programs, in order to get a unique output file for each process, you will need to set the undocumented environment variable GMON_OUT_PREFIX to some non-null string. For example:

setenv GMON_OUT_PREFIX 'gmon.out.'`/bin/uname -n`

pgprof

PGI profiler - pgprof is a tool which analyzes data generated during execution of specially compiled programs. This information can be used to identify which portions of a program will benefit the most from performance tuning.

PAPI

Portable hardware performance counter library.

PapiEx

A PAPI-based performance profiler that measures hardware performance events of an application without having to instrument the application.

VTune Amplifier

The Intel VTune Amplifier tool is a performance analysis tool for finding hotspots in serial and multithreaded codes. Note the installation on LC machines does not include the advanced hardware analysis capabilities.

Intel Profiler

The Intel Profiler tool is built into the Intel compiler along with a simple GUI to display the collected results.

Vampir / Vampirtrace

Full featured trace file visualizer and library for generating trace files for parallel programs.

Beyond LC:

Beyond LC, the web offers endless opportunities for discovering tools that aren't available here.
In many cases, users can install tools in their own directories if LC doesn't have/support them.

Linux Clusters Overview Exercise 2

Parallel Programs and More

Overview:

Login to an LC workshop cluster, if you are not already logged in
Build and run parallel MPI programs
Build and run parallel Pthreads programs
Build and run parallel OpenMP programs
Build and run a parallel benchmarks
Build and run an MPI message passing bandwidth test
Try hyper-threading
Obtain online machine status information

GO TO THE EXERCISE HERE

This completes the tutorial.

Please complete the online evaluation form - unless you are doing the exercise, in which case please complete it at the end of the exercise.

Where would you like to go now?

References and More Information

Author: Blaise Barney, Livermore Computing.
Hardware/Vendor Information:
- AMD: amd.com
- Intel: intel.com
- Mellanox: mellanox.com
- QLogic: qlogic.com
- Voltaire: voltaire.com
Online help documents on every LC Linux cluster: /usr/local/docs/ - in particular, linux.basics.
SLURM Documentation
http://slurm.schedmd.com/documentation.html
Linux Project Report
2002 Linux Project Report
Photos/Graphics: Have been created by the author, created by other LLNL employees, obtained from non-copyrighted sources, or used with the permission of authors from other presentations and web pages.

System	Network	Nodes	CPUs/Cores	Gflops
ALC	OCF	960	1,920	9,216
LILAC	SCF	768	1,536	9,186
ACE	SCF	160	320	1,792
SPHERE	OCF	96	192	1,075
GVIZ	SCF	64	128	717
ILX	OCF	67	134	678
PVC	OCF	64	128	614

Xeon 5600 node with motherboard Click for larger image	Xeon 5500/5600 motherboard diagram Click for larger image
Sandy Bridge EP node with motherboard Click for larger image

MPI Build Scripts
Implementation	Language	Script Name	Underlying Compiler
MVAPCH 1.2	C	`mpicc`	gcc - GNU
		`mpigcc`	gcc - GNU
		`mpiicc`	icc - Intel
		`mpipgcc`	pgcc - PGI
	C++	`mpiCC`	g++ - GNU
		`mpig++`	g++ - GNU
		`mpiicpc`	icpc - Intel
		`mpipgCC`	pgCC - PGI
	Fortran	`mpif77`	g77 - GNU
		`mpigfortran`	gfortran - GNU
		`mpiifort`	ifort - Intel
		`mpipgf77`	pgf77 - PGI
		`mpipgf90`	pgf90 - PGI
MVAPCH2	C	`mpicc`	C compiler of dotkit package loaded
	C++	`mpicxx`	C++ compiler of dotkit package loaded
	Fortran	`mpif77`	Fortran77 compiler of dotkit package loaded
	Fortran	`mpif90`	Fortran90 compiler of dotkit package loaded
Open MPI	C	`mpicc`	C compiler of dotkit package loaded
	C++	`mpiCC mpic++ mpicxx`	C++ compiler of dotkit package loaded
	Fortran	`mpif77`	Fortran77 compiler of dotkit package loaded
	Fortran	`mpif90`	Fortran90 compiler of dotkit package loaded

	Interconnect	No Interconnect
Clusters	Most linux clusters	aztec, inca, rzcereal
Parallelism	Designated as a parallel resource. A single job can span many nodes.	Designated for serial and single node parallel jobs only. A job cannot span more than one node.
Node Sharing	Compute nodes are NOT shared with other users or jobs. When your job runs, the allocated nodes are dedicated to your job alone.	Multiple users and their jobs can run on the same node simultaneously. Can result in competition for resources such as CPU and memory.

Moab Command	Description
`msub`	Submits a job to Moab
`showq, checkjob, mdiag -j`	Check job status, monitor job information
`mjobctl -c, canceljob`	Remove a running or queued job
`mjobctl -h`	Place a queued job on hold
`mjobctl -u`	Release a held job
`mjobctl -m`	Modify job attributes (limited subset)
`mshare, mdiag -u`	View your available banks
`mdiag -t`	View batch configuration information
`mdiag -p`	View job priorities

Difference	Interactive	Batch
Where used:	From login node command line	In batch script
Partition:	Requires specification of an interactive partition, such as pdebug with the -p flag	pbatch is default
Scheduling:	If there are available interactive nodes, job will run immediately. Otherwise, it will queue up (fifo) and wait until there are enough free nodes to run it.	The batch scheduler handles when to run your job regardless of the number of nodes available.

srun -n64 -ppdebug my_app	64 process job run interactively in pdebug partition
srun -N64 -n512 my_threaded_app	512 process job using 64 nodes. Assumes pbatch partition.
srun -N4 -n32 -c2 my_threaded_app	2 node, 32 process job with 2 cores (threads) per process. Assumes pbatch partition.
srun -N8 my_app	8 node job with a default value of one task per node (8 tasks). Assumes pbatch partition.
srun -n128 -o my_app.out my_app	128 process job that redirects stdout to file my_app.out. Assumes pbatch partition.
srun -n32 -ppdebug -i my.inp my_app	32 process interactive job; each process accepts input from a file called my.inp instead of stdin

Option	Description
--auto-affinity=[args]	Specifies how to bind tasks to cpus. Discussion available HERE.
-c [#cpus/task]	The number of CPUs used by each process. Use this option if each process in your code spawns multiple POSIX or OpenMP threads.
-d	Specify a debug level - integer value between 0 and 5
-i [file] -o [file]	Redirect input/output to file specified
-I	Allocate CPUs immediately or fail. By default, srun blocks until resources become available.
-J	Specify a name for the job
-l	Label - prepend task number to lines of stdout/err
-m block\|cyclic	Specifies whether to use block (the default) or cyclic distribution of processes over nodes
--multi-prog config_file	Run a job with different programs with/without different arguments for each task. Intended for MPMD (multiple program multiple data) model MPI programs. Discussion available HERE.
-n [#processes]	Number of processes that the job requires
-N [#nodes]	Number of nodes on which to run job
-O	Overcommit - srun will refuse to allocate more than one process per CPU unless this option is also specified
-p [partition]	Specify a partition (cluster) on which to run job
-s	Print usage stats as job exits
-v -vv -vvv	Increasing levels of verbosity
-V	Display version information

Compiler	Vector Flag	AVX Flag	Reporting
Intel	`-vec` (default)	`-avx`	-opt_report -vec_report
PGI	-fast	n/a	-Minfo=all
GNU	-O3	`-mavx` (C/C++ only)	-ftree-vectorizer-verbose=1

csh/tcsh	`unlimit` -or- `limit coredumpsize 64`
sh/ksh/bash	ulimit -c 64


QLogic IB Adapter	Mellanox IB Adapter

Linux Clusters Overview

Table of Contents

LC Linux Clusters Summary

Intel Xeon Processor

AMD Opteron Processor

Infiniband Interconnect

General Information

Intel Compilers

PGI Compilers

GNU Compilers

Getting Started

MVAPICH

Open MPI

MPI Build Scripts

Level of Thread Support

Overview

Batch Versus Interactive

Starting Jobs - srun

Terminating Jobs

Displaying Queue and Job Status Information

Optimizing CPU Usage

Memory Considerations

Vectorization and Hyper-threading

Process and Thread Binding

Parallel Programs and More