This tutorial is intended to be an introduction to using LC's Linux
clusters. It begins by providing a brief historical background of Linux clusters
at LC, noting their success and adoption as a production, high performance
computing platform. The primary hardware components of LC's Linux clusters
are then presented, including the various types of nodes, processors and switch interconnects. The detailed hardware configuration for each of LC's production Linux clusters completes the hardware related information.
After covering the hardware related topics, software topics are discussed,
including the LC development environment, compilers, and how to run both batch and interactive parallel jobs. Important issues in each of these areas are
noted. Available debuggers and performance related tools/topics are briefly discussed, however detailed usage is beyond the scope of this tutorial.
A lab exercise using one of LC's Linux clusters follows the presentation.
Level/Prerequisites: This tutorial is intended for those who are new to developing parallel programs in LC's Linux cluster environment. A basic understanding of parallel programming in C or Fortran is required. The material covered by the following tutorials would also be helpful:
EC3501: Livermore Computing Resources and Environment EC4045: Moab and SLURM
Background of Linux Clusters at LLNL
The Linux Project:
LLNL first began experimenting with Linux clusters in 1999-2000 in a
partnership with Compaq and Quadrics to port Quadrics software to
Alpha Linux.
The Linux Project was started for several reasons:
Cost: price-performance analysis demonstrated that near-commodity
hardware in clusters running Linux could be more cost-effective
than proprietary solutions;
Focus: the decreasing importance of high-performance computing
(HPC) relative to commodity purchases was making it more difficult to
convince proprietary systems vendors to implement HPC specific
solutions;
Control: it was believed that by controlling the OS in-house,
Livermore Computing could better support its customers;
Community: the platform created could be leveraged by the general
HPC community.
The objective of this effort was to apply LC's scalable systems strategy
(the "Livermore Model") to commodity hardware running the open source
Linux OS:
Based on SMP compute nodes attached to a high-speed, low-latency
interconnect.
Uses OpenMP to exploit SMP parallelism within a node and MPI
to exploit parallelism between nodes.
Provides a POSIX interface parallel filesystem.
Application toolset: C, C++ and Fortran compilers, scalable
MPI/OpenMP GUI debugger, performance analysis tools.
The first Linux cluster implemented by LC was LX, a Compaq Alpha Linux
system with no high-speed interconnect.
The first production Alpha cluster targeted to implement the full
Livermore Model was Furnace, a 64-node system comprised of dual-CPU
EV68 processors with a QSnet interconnect. However...
Compaq announced the eventual discontinuation of the Alpha server
line
Intel Pentium 4 with favorable SPECfp performance was released just
as Furnace was delivered.
This prompted Livermore to shift to an Intel IA32-based model for its Linux
systems in July 2001.
Furnace's interconnect was allocated to the IA32-based PCR clusters (below)
instead. It then operated as a loosely coupled cluster until it was
decommissioned in 10/03.
PCR Clusters:
August 2001: The Parallel Capacity Resource (PCR) clusters were purchased
from Silicon Graphics and Linux NetworX. Consisted of:
Adelie: 128-node production cluster
Emperor: 88-node production cluster
Dev: 26-node development cluster
Each PCR compute node had two 1.7-GHz Intel Pentium 4 CPUs and a QsNet
Elan3 interconnect.
Parallel file system was not implemented at that time - instead
dedicated BlueArc NFS servers were used.
SCF resource only
The 16-node Pengra cluster was procured for the OCF to provide a less
restrictive development environment for PCR related work in July, 2002.
The success of the PCR clusters was followed by the purchase of the
Multiprogrammatic Capability Resource (MCR) cluster in July, 2002 from
Linux NetworX.
1152 node cluster comprised of dual-processor, 2.4 GHz Intel Xeons
MCR's procurement was intended to significantly increase the resources
available to Multiprogrammatic and Institutional Computing (M⁣) users.
MCR's configuration included the first production implementation of the
Lustre parallel file system, an integral part of the "Livermore Model".
Debuted as #5 on the Top500 Supercomputers list in November, 2002, and then
peaked at #3 in June, 2003.
Convinced of the successfullness of this path, LC implemented several
other IA-32 Linux clusters simultaneously with, or after, the MCR Linux
cluster:
System
Network
Nodes
CPUs/Cores
Gflops
ALC
OCF
960
1,920
9,216
LILAC
SCF
768
1,536
9,186
ACE
SCF
160
320
1,792
SPHERE
OCF
96
192
1,075
GVIZ
SCF
64
128
717
ILX
OCF
67
134
678
PVC
OCF
64
128
614
Which Led To Thunder...
In September, 2003 the RFP for LC's first IA-64 cluster was released.
Proposal from
California Digital Corporation, a small local company, was accepted.
1024 node system comprised of 4-CPU Itanium 2 "Madison Tiger4" nodes
Thunder debuted as #2 on the Top500 Supercomputers list in June, 2004.
Peak performance was 22.9 Tflops.
In early 2006, LC launched its Opteron/Infiniband Linux cluster procurement
with the release of the Peloton RFP.
Appro www.appro.com was
awarded the contract in June, 2006.
Peloton clusters were built in 5.5 Tflop "scalable units" (SU) of ~144 nodes
All Peloton clusters used AMD dual-core Socket F Opterons:
8 cpus per node
2.4 GHz clock
Option to upgrade to 4-core Opteron "Deerhound" later (not taken)
The six Peloton systems represented a mix of resources: OCF, SCF,
ASC, M⁣, Capability and Capacity:
System
Network
Nodes
CPUs/Cores
Tflops
Atlas
OCF
1,152
9,216
44.2
Minos
SCF
864
6,912
33.2
Rhea
SCF
576
4,608
22.1
Zeus
OCF
288
2,304
11.1
Yana
OCF
83
640
3.1
Hopi
SCF
76
608
2.9
The last Peloton clusters were retired in June 2012.
Atlas Peloton Cluster
And Then, TLCC and TLCC2:
In July, 2007 the Tri-laboratory Linux Capacity Cluster (TLCC) RFP was
released.
The TLCC procurement represents the first time that the Department of
Energy/National Nuclear Security Administration (DOE/NNSA) has awarded a
single purchase contract that covers all three national defense
laboratories: Los Alamos, Sandia, and Livermore. Read the announcement
HERE.
The TLCC architecture is very similar to the Peloton architecture: Opteron
multi-core processors with an Infiniband interconnect. The primary
difference is that TLCC clusters are quad-core instead of dual-core.
TLCC clusters were/are:
System
Network
Nodes
CPUs/Cores
Tflops
Juno
SCF
1,152
18,432
162.2
Hera
OCF
864
13,824
127.2
Eos
SCF
288
4,608
40.6
In June, 2011 the TLCC2 procurement was announced, as a follow-on to the
successful TLCC systems. Press releases:
The TLCC2 systems are LC's current "capacity" compute platform, consisting of
multiple Intel Xeon E5-2670 (Sandy Bridge EP), QDR Infiniband based clusters:
System
Network
Nodes
CPUs/Cores
Tflops
Zin
SCF
2,916
46,656
961.1
Cab
OCF-CZ
1,296
20,736
426.0
Rzmerl
OCF-RZ
162
2,592
53.9
Pinot
SNSI
162
2,592
53.9
And Now? What Next?
LC has continued to procure Linux clusters for various purposes, which
follow the same basic architecture as the TLCC2 systems:
64-bit, x86 architecture (Intel)
multi-core, multi-socket
QDR Infiniband interconnect
As of June, 2015, future TLCC-type procurements are currently in progress.
They will be called Commodity Technology Systems (CTS), and are
expected to begin arriving in mid-2016.
Juno, Eos TLCC Clusters
Zin TLCC2 Cluster
Cluster Configurations and Scalable Units
Basic Components:
Currently, LC has several types of production Linux clusters
based on the following processor architectures:
All of LC's Linux clusters differ in their configuration details, however
they do share the same basic hardware building blocks:
Nodes
Frames / racks
High speed Infiniband interconnect (most clusters)
Other hardware (file systems, management hardware, etc.)
Nodes:
The basic building block of a Linux cluster is the node. A node is
essentially an independent computer. Key features:
Self-contained, diskless, multi-core computer.
Low form-factor - Clusters nodes are very thin in order to
save space.
Rack Mounted - Nodes are mounted compactly in a drawer fashion to
facilitate maintenance, reduced footprint, etc.
Remote Management - There is no keyboard, mouse, monitor or other
device typically used to interact with a computer. All node management
occurs over the network from a "management" node.
Nodes are typically configured into 4 types, according to their function:
Compute nodes - Nodes that run user jobs. The majority of nodes
in a cluster. Compute nodes are typically split into one of
two partitions: batch or interactive/debug.
Login nodes - One or more per cluster. This is where you login
for access to the compute nodes. Login nodes are also used
to build your applications and control cluster jobs.
Gateway (I/O) nodes - These nodes are dedicated
fileservers. They connect the compute nodes to essential file systems
which are mounted on disk storage devices, such as Lustre OSTs.
The number of these nodes vary per cluster.
Administrative/management nodes - Used by system administrators to
manage the entire cluster. Not accessible to users.
Example images (click for larger image):
Single compute node - TLCC2
Single compute node - TLCC
Frames / Racks:
Frames are the physical cabinets that hold most of a cluster's components:
Nodes of various types
Switch components
Other network components
Parallel file system disk resources (usually in separate racks)
Vary in size/appearance between the different Linux clusters at LC.
Power and console management - frames include hardware and software that
allow system administrators to perform most tasks remotely.
Example images below (click for larger image):
Frames - TLCC2
Frames - TLCC
Scalable Unit:
The basic building block of LC's production Linux clusters is called a
"Scalable Unit" (SU). An SU consists of:
Nodes (compute, login, management, gateway)
First stage switches that connect to each node directly
Miscellaneous management hardware
Frames sufficient to house all of the hardware
Additionally, a second stage switch is also needed
for every 2 SUs in a multi-SU cluster (not shown).
The number of nodes in an SU depends upon the type of first and
second stage switches being used. For example:
Xeon 5600 node with motherboard
Click for larger image
Xeon 5500/5600 motherboard diagram
Click for larger image
Sandy Bridge EP node with motherboard
Click for larger image
AMD Opteron Hardware Overview
AMD Opteron Processor
Basics:
The AMD Opteron debuted in April, 2003 as the first 64-bit architecture
compatible with the industry-standard x86 instruction set.
Built on AMD64 technology (formerly code named "Hammer") which extends full
backward compatibility for 32-bit x86 software. Also enables
64-bit computing for non-x86 applications.
Initial offering was a single-core processor designed with multi-core
in mind. Dual-core processor became available in April, 2005 and a
quad-core processor in 2007.
Multi-core Opterons are used to build 2-, 4-, 8- and 16-way SMPs.
Target market: multi-processor servers and workstations.
AMD also manufactures multi-core 64-bit processors (i.e. Athlon) for
other markets.
Most of LC's Opteron based clusters are built from quad-core
processors.
Because Opterons are x86 based, they are "little endian" like the
Intel IA32 processor.
64-bit architecture providing full 64-bit memory addressing
Clockspeed: 2.2 - 2.3 GHz
Floating point units: 4 results per clock cycle
Caches:
Cache
Quad-core
Dedicated L1 Instruction
64 KB, 64-byte line, 2-way associative
Dedicated L1 Data
64 KB, 64-byte line, 2-way associative
Dedicated L2
512 KB
Shared L3
2 MB
Direct Connect Architecture:
No front side bus - directly connects multiple processor cores,
memory controller and I/O to the Socket F processor.
Helps eliminate the bottlenecks
inherent in a traditional front-side bus.
Integrated (on-die) memory controller. CPU-memory bandwidth is 10.7 GB/s
per Socket F with DDR2-667 DIMMs.
HyperTransport interconnects (3) directly connect sockets to I/O subsystems,
other chipsets, and other sockets (off-chip CPUs).
Provide 8 GB/s bandwidth per
link (4 GB/s each direction). Aggregate bandwidth of 24 GB/s
per processor (shared by all cores).
Full support for Intel's SIMD vectorization instructions (SSE, SSE2, SSE3...)
Reduced power consumption:
Energy efficient DDR2 memory
AMD dynamic power management technology
AMD Virtualization: disparate applications can coexist on same system. Each
application's environment (operating system, middleware, communications,
etc.) is represented as a virtual machine.
Design features common to most modern processors:
Enhanced branch prediction capabilities
Speculative / out-of-order execution
Superscalar
Pipelined architecture
Chipsets/Motherboards:
As with other vendor processors, AMD Opterons are commonly combined with
other components to make a complete system.
Motherboards for Opteron systems are manufactured by a number of vendors.
LC's machines use the SuperMicro H8QM8-2 motherboard for both dual-core
and quad-core machines.
Ports/controllers for SCSI, SATA, IDE, USB, keyboard, mouse, serial,
parallel, etc. (most of these not used by LC machines)
Photo at right, block diagram below
Infiniband Interconnect Overview
Infiniband Interconnect
The high speed interconnect for LC's Linux clusters is Infiniband.
The type of Infiniband hardware varies by cluster. In general:
Most Intel Xeon clusters use 4x QDR QLogic switches and adapters
A few older clusters use 4x DDR Voltaire switches
and Mellanox adapters.
4x Infiniband:
4x = 4 times the base Infiniband link rate of 2.5 Gbits/sec,
which equals 10 Gbits/sec, full duplex.
SDR = Single Data Rate - 10 Gbits/sec (1.25 GB/s)
DDR = Double Data Rate - 20 Gbits/sec (2.5 GB/s)
QDR = Double Data Rate - 40 Gbits/sec (5 GB/s)
Primary components:
Adapter Card:
Communications processor packaged on network PCI Express adapter card.
Remote Direct Memory Access (RDMA) improves communication bandwidth by
off-loading communications from the CPU.
Provides the interface between a node and a two-stage network.
Copper cabling connects each adapter card to a first stage
switch, in most cases. A few LC clusters (ansel, catalyst)
without a second stage switch use
optic fiber to connect to a single core switch.
QLogic QDR or Mellanox 4x DDR
QLogic IB Adapter
Mellanox IB Adapter
1st Stage Switch:
QLogic QDR 36-port: 18 ports connect to adapters in nodes and 18
ports connect to a second stage switch.
Voltaire DDR 24-port: 12 ports connect to adapters in nodes and 12
ports connect to a second stage switch.
2nd Stage Switch:
QLogic QDR 18-864 port: all used ports connect to a first stage switches
via fiber optic cabling.
Voltaire DDR 288-port: all ports connect to a first stage switches
via copper cabling.
Switch images below (click for a larger image):
QLogic 1st and 2nd Stage Switches (back)
Voltaire 1st Stage Switches (front, back)
Voltaire 2nd Stage Switch (front, back)
Topology:
Two-stage, federated, bidirectional, fat-tree.
The number of second stage switches depends upon the number
of scalable units (SUs) that comprise the cluster and the type of switch
hardware used.
Examples:
1152-way Interconnect Juno - 8 SU
1944-way Interconnect Sierra - 12 SU
Performance:
The inter-node bandwidth measurements below were taken on live, heavily
loaded,
LC machines using a simple MPI non-blocking test code. One task on each of two
nodes. Not all systems are represented. Your mileage may vary.
System Type
Latency
Bandwidth
Older Clusters with DDR Voltaire/Mellanox (Juno)
~3-5 us
~2.4 GB/sec
Intel Xeon Clusters with QDR QLogic (Muir, Sierra, Ansel)
~1-2 us
~4.1 GB/sec
Intel Xeon Clusters with QDR QLogic (TLCC2)
~1 us
~5.0 GB/sec
Software and Development Environment
This section only provides a summary of the software and development
environment for LC's Linux clusters. Please see the
Introduction to LC Resources tutorial for details.
TOSS Operating System:
All LC Linux clusters use TOSS (Tri-Laboratory Operating System Stack).
The primary components of TOSS include:
Red Hat Enterprise Linux (RHEL) distribution with modifications to
support targeted HPC hardware and cluster computing
RHEL kernel optimized for large scale cluster computing
OpenFabrics Enterprise Distribution InfiniBand software stack including
MVAPICH and OpenMPI libraries
Moab workload manager and SLURM resource manager
Integrated Lustre and Panasas parallel file system software
Scalable cluster administration tools
Cluster monitoring tools
GNU, C, C++ and Fortran90 compilers (GNU, Intel, PGI)
Testing software framework for hardware and operating system
validation
Batch Systems:
SLURM
LC's Simple Linux Utility for Resource Management
SLURM is the native job scheduling system on each cluster
The SLURM resource manager on one cluster does not communicate with
the SLURM resource managers on other clusters
Shared by all users on a cluster or multiple clusters
Lustre is discussed
in the Parallel File Systems section of
the Introduction to Livermore Computing Resources tutorial.
Are usually mounted by multiple clusters.
/nfs/tmp# - large globally mounted NFS file systems
shared across all machines by all users, not backed up, purged, quotas
in effect.
/var/tmp, /usr/tmp, /tmp - different names for the
same file system, local (non-NFS) mounted, moderate size, not backed up, purged,
shared by all users on a given node.
Archival HPSS storage - accessed by ftp storage.
Virtually unlimited file space, not backed up or purged. More info:
computing.llnl.gov/jobs/#sbp - see the section on storage.
LC provides the Dotkit system as a convenient, uniform way to select among
multiple versions of software installed on the LC systems.
Dotkit provides a simple user interface that is standard across UNIX shells.
Dotkit is most often used to setup your environment for a particular
compiler or application that isn't in your default environment.
Using Dotkit:
List available packages: use -l
Load a package: use package_name
Unload a package: unuse package_name
List loaded packages: use
Search available packages: use -l keyword
Read package help info: use -h package_name
Display package contents: use -hv package_name
Show default packages: dpkg-defaults (look for asterisks)
Development Environment Group supported software includes compilers, libraries, debugging, profiling, trace
generation/visualization, performance analysis tools, correctness tools, and several utilities:
https://computing.llnl.gov/?set=code&page;=software_tools.
The Livermore Interactive Numerical Mathematical Software Access Utility, is a Web-based access utility for math library software. The LINMath Web site also has pointers to packages available from external sources:
http://www-lc.llnl.gov/linmath/
Center for Applied Scientific Computing (CASC) Software
A wide range of software available for download from LLNL's CASC. Includes mathematical software, language tools, PDE software frameworks, visualization, data analysis, program analysis, debugging, and benchmarks:
https://computation.llnl.gov/casc/software.php
Man Pages:
Linux man pages (and SLURM man pages) are located in
/usr/share/man.
If you have problems bringing up the man page for a Linux command, check
your MANPATH variable for /usr/share/man.
Compilers
General Information
Available Compilers and Invocation Commands:
The table below summarizes compiler availability and invocation commands
on LC Linux clusters.
Note that parallel compiler commands are actually LC scripts that
ultimately invoke the corresponding serial compiler.
Compiler
Serial Command
Parallel Command
Intel
C
icc
mpiicc
C++
icpc
mpiicpc
Fortran
ifort
mpiifort
PGI
C
pgcc
mpipgcc
C++
pgCC
mpipgCC
Fortran
pgf77 pgf90
mpipgf77 mpipgf90
GNU
C
gcc
mpicc mpigcc
C++
g++
mpiCC mpig++
Fortran
g77 gfortran
mpif77 mpif90 mpigfortran
Versions, Defaults and Paths:
LC usually maintains multiple versions of each compiler on a cluster. To
see all available versions, login and use the command
use -l compiler. Example (truncated) below:
% use -l gnu
compilers/gnu ----------
gcc-4.4.6 - GNU Compiler Collection (gcc/g++/gfortran)
gcc-4.6.1 - GNU Compiler Collection (gcc/g++/gfortran)
gcc-4.7.1p - Pure GNU Compilers (gcc/g++/gfortran) provided AS IS
gcc-4.8.2p - Pure GNU Compilers (gcc/g++/gfortran) provided AS IS
gcc-4.9.0p - Pure GNU Compilers (gcc/g++/gfortran) provided AS IS
gcc-4.9.2p - Pure GNU Compilers (gcc/g++/gfortran) provided AS IS
gcc - GNU Compiler Collection (gcc/g++/gfortran)
...
[ output omitted here ]
...
mpi/openmpi ----------
openmpi-gnu-1.4.3 - Open MPI v1.4.3 for GNU compilers
openmpi-gnu-1.6.5 - Open MPI v1.6.5 for GNU compilers
openmpi-gnu-1.8.3 - Open MPI v1.8.3 for GNU compilers
openmpi-gnu-1.8.4 - Open MPI v1.8.4 for GNU compilers
openmpi-gnu-debug-1.4.3 - Open MPI v1.4.3 for GNU compilers (debug)
openmpi-gnu-debug-1.6.5 - Open MPI v1.6.5 for GNU compilers (debug)
openmpi-gnu-debug-1.8.3 - Open MPI v1.8.3 for GNU compilers (debug)
openmpi-gnu-debug-1.8.4 - Open MPI v1.8.4 for GNU compilers (debug)
openmpi-gnu-debug - Open MPI v1.4.3 for GNU compilers (debug)
The see the default compiler version for LC machines: consult the
Compilers Installed on LC Platforms web page.
Alternately, use the dpkg-defaults command to list
software versions (shown below). Defaults are denoted by an asterisk.
To determine the actual version you are using, issue the compiler invocation
command with its "version" option. For example:
Compiler
Option
Example
Intel
-V
ifort -V
PGI
-V
pgf90 -V
GNU
-v --version
g++ --version
To use a specific version of a compiler other than the default do
the following:
Login to the machine where you will be working
Issue the use -l compilers
command to list all available compiler versions.
See what is available and select one, noting its package/dotkit name
Issue the command: use package-name
Floating-point Exceptions:
The IEEE floating point standard defines several exceptions (FPEs)
that occur
when the result of a floating point operation is unclear or undesirable:
overflow: an operation's result is too large to be represented as a
float. Can be trapped, or else returned as a +/- infinity.
underflow: an operation's result is too small to be represented as a
normalized float. Can be trapped, or else represented as
as a denormalized float (zero exponent w/ non-zero fraction) or zero.
divide-by-zero: attempting to divide a float by zero. Can be trapped,
or else returned as a +/- infinity.
inexact: result was rounded off. Can be trapped or returned
as rounded result.
invalid: an operation's result is ill-defined, such as 0/0 or the
sqrt of a negative number. Can be trapped or returned as NaN (not a
number).
By default, the Opteron and Xeon processors used at LC mask/ignore
FPEs. Programs that encounter FPEs will not terminate
abnormally, but instead, will continue execution
with the potential of producing wrong results.
Compilers differ in their ability to handle FPEs:
Intel's Fortran compiler offers some means of controlling FPE handling
through the -fpe* options.
Intel's C/C++ compiler provides the -fp-trap*
options.
PGI compilers offer the convenient -Ktrap* options to
unmask FPEs.
GNU Fortran: offers the -ffpe-trap=* options
GNU C/C++ compilers: using the feenableexcept()
routine may help. See the man page for details. Simple example below:
#include <fenv.h>
#include <stdio.h>
int main(int argc, char **argv) {
double zerodiv= 0.0;
feenableexcept(-1); // Enable all floating point exceptions
double nanval=0.0/zerodiv;
printf("Should not get here: zerodiv=%lf, nanval=%lf\n",zerodiv,nanval);
}
Unrelated to compilers, Linux treats integer divides-by-zero as
floating-point exceptions. For example, the following two codes will
behave differently. The integer divide by zero will abort execution,
while the floating point divide by zero will not.
This applies to both Fortran and C/C++.
#include <stdio.h>
int main()
{
int i = 2;
i /= 0;
printf("i = %d\n",i);
}
#include <stdio.h>
int main()
{
float i = 2;
i /= 0;
printf("i = %f\n",i);
}
Precision, Performance and IEEE 754 Compliance:
Typically, most compilers do not guarantee IEEE 754 compliance
for floating-point arithmetic unless it is explicitly specified by a
compiler flag. This is because compiler optimizations are performed
at the possible expense of precision and performance.
Unfortunately for most programs, adhering to IEEE floating-point
arithmetic adversely affects performance.
If you are not sure whether your application needs this, try
compiling and running your program both with and without it to
evaluate the effects on both performance and precision.
See the relevant compiler documentation for details.
Mixing C and Fortran:
If you are linking C/C++ and FORTRAN code together, and need to explicitly
specify the FORTRAN or C/C++ libraries on the link line, LC provides a
general recommendation and example in the
/usr/local/docs/linux.basics file. See the
"MIXING C AND FORTRAN" section.
All of the other issues involved with mixed language programming apply, such as:
Column-major vs. row-major array ordering
Routine name differences - appended underscores
Arguments passed by reference versus by value
Common blocks vs. extern structs
Memory alignment differences
File I/O - Fortran unit numbers vs. C/C++ file pointers
If your executable contains a large amount of static data, LC recommends
that you compile with the -mcmodel=medium option.
This flag allows the Data and .BSS (static variables) sections of your
executable to extend beyond a default 2 GB limit.
True for all compilers (see the respective compiler man page).
Compiler Documentation and Man Pages:
Man Pages:
Compiler man pages are for the default version of the compiler.
If you want to see the man pages for a different compiler version, load your
environment with the compiler version of interest with
the use package command first. For example:
use icc-12.1.339. Then issue the man command.
Additionally the info utility can be used to
view man pages and info docs.
Documentation:
Intel: compiler docs are included in the installation directory under
/usr/local/tools/compilername.
Otherwise, see Intel's web pages.
PGI: compiler docs are included in the installation directory under
/usr/local/tools/pgi. Otherwise, see PGI's web
pages.
The Intel compilers are specifically designed to take advantage of
Intel chip architectures.
For Opteron clusters, LC supports the Intel compilers with the
caveat that they have not been rigorously tested on this architecture.
Installed under /usr/local/tools/icc* and
/usr/local/tools/ifort*.
There can be several versions of the Intel compilers available.
To see what is currently available on the machine you are using, issue
the use -l compilers command.
To select a particular version,
use the use package-name command.
Full compiler documentation is available in the installation directory.
Complete man pages are also available for compiler version after you load the
package of choice as described above.
GNU compatibility: Objects created with the Intel C/C++ compiler are
binary compatible with GNU gcc. Also, the Intel C/C++ compiler supports
many of the language extensions of
gcc and g++. See the Intel C++ Compiler User's Guide for details.
Compiler Invocation Commands:
icc
serial/OpenMP C
icpc
serial/OpenMP C++
ifort
serial/OpenMP Fortran 77 and 90
mpiicc
script for C with MPI
mpiicpc
script for C++ with MPI
mpiifort
script for Fortran with MPI
Note: The Intel C and C++ compiler are actually the same compiler - the
Intel C++ compiler product. The invocation command (icc or icpc)
determines how the compiler behaves. Be sure to use the appropriate
invocation command for your source - don't use icc for C++ or icpc
for C.
Common / Useful Options:
A few useful compiler options are shown below. Interested users will
definitely want to consult the relevant Intel documentation.
Option
Description
C/C++
Fortran
-I, -L, -l, -lm, -c, -o
As usual
-ansi_alias
-no-ansi_alias
Can help performance. Directs the compiler to assume the
following:
-Arrays are not accessed out of bounds.
-Pointers are not cast to non-pointer types, and vice-versa.
-References to objects of two different scalar types cannot alias.
If your program does not satisfy one of the above conditions, this flag
may lead the compiler to generate incorrect code.
For Fortran, conformance is according to Fortran 95 Standard type
aliasability rules.
C/C++ Default = -no-ansi_alias (off)
Fortran Default = -ansi_alias (on)
-assume keyword -assume buffered_io
Specifies assumptions made by the compiler. One option that may improve
I/O performance is buffered_io, which causes sequential file
I/O to be buffered rather than being written to disk immediately.
See the ifort man page for details.
-auto
-automatic
-nosave
-save
-noauto
-noautomatic
Places variables, except those declared as SAVE, on the run-time stack.
The default is -auto_scalar (local scalar of types INTEGER, REAL,
COMPLEX, or LOGICAL are automatic). However, if you specify -recursive
or -openmp, the default is -auto.
Places variables, except those declared as AUTOMATIC, in static memory.
However, if you specify -recursive or -openmp, the default is -auto.
-autodouble
Defines real variables to be REAL(KIND=8). Same as specifying -r8.
-convert keyword
Specifies the format for unformatted files, such as big endian, little
endian, IBM 370, Cray, etc.
-Dname[=value]
Defines a macro name and associates it with a specified
value. Equivalent to a #define preprocessor directive.
-fast
Shorthand for several combined optimization options:
-O3, -ipo -static
-fpp
-cpp
Invoke Fortran preprocessor. -fpp and -cpp are equivalent.
Specifies the run-time floating-point exception handling behavior. See the compiler man page for details.
-g
Build with debugging symbols. Note that -g does not imply -O0 in the
Intel compilers; -O0 must be specified explicitly to turn all optimizations
off.
-help
Print compiler options summary
-ip
Enable single-file interprocedural optimizations.
-ipo
Enable multi-file interprocedural optimizations.
-mcmodel=medium
Required if executable is greater than 2 GB. The program is linked in the lower 2 GB of the address space but symbols can be located anywhere in the address space. Programs can be statically or dynamically linked, but building of shared libraries are not supported with the medium model.
-mp
'Maintain precision' - favor conformance to IEEE 754 standards for
floating-point arithmetic.
-mp1
Improve floating-point precision - less speed impact than -mp.
-O0
Turn off optimizer - recommended if using -g for debugging.
-O, -O1, -O2, -O3
Optimization levels. (O,O1,O2 are essentially equivalent). -O3 is
the most aggressive optimization level. Note that Intel compilers
perform optimization by default.
The PGI C, C++ and Fortran compilers as well as the PGI debugger
and profiler tools are available on LC's Linux clusters.
The PGI compilers are placed in your PATH and MANPATH environment
variables by default.
Installed under /usr/local/tools/pgi*.
There can be several versions of the PGI compilers available.
To see what is currently available on the machine you are using, issue
the use -l compilers command.
To select a particular version,
use the use package-name command.
Full PGI documentation is available in the
doc install directory. Complete man pages are
also available for every PGI tool after you load the version of choice
as described above.
Support SSE instructions
The PGI compilers provide a compiler option for unmasking
floating-point exceptions. See the -Ktrap option in the table below.
Compiler Invocation Commands:
pgcc
serial/OpenMP C
pgCC
serial/OpenMP C++
pgf77 pgf90
serial/OpenMP Fortran 77
and 90
mpipgcc
script for C with MPI
mpipgCC
script for C++ with MPI
mpipgf77 mpipgf90
scripts for Fortran 77
and Fortran 90 with MPI
Common / Useful Options:
A few useful compiler options are shown below. Interested users will
definitely want to consult the relevant PGI documentation.
Option
Description
-I, -L, -l, -lm, -c, -o
As usual
-fast
Turn on optimizations, including vectorization
-fpic
Instructs the compiler to generate position independent code which
can be used to create shared object files (dynamically linked
libraries).
-help
Display help
-Kieee
Force IEEE 754 arithmetic
-Ktrap=[option]
Unmask selected FPU exceptions. Options include
fp,inv,denorm,divz,ovf,unf,inexact
-lpthread
Link with pthreads library.
-mcmodel=medium
Required if executable is greater than 2 GB. The program is linked in the lower 2 GB of the address space but symbols can be located anywhere in the address space. Programs can be statically or dynamically linked, but building of shared libraries are not supported with the medium model.
-mp
Turn on OpenMP
-Mvect=[prefetch,sse]
Enable prefetch, SSE
-Mlist
Create a listing file
-Minfo=all
Generate an optimization/vectorization report
-Mipa
Enable interprocedural analysis. See man page for details.
-O, -O1, -O2, -O3/O4
Optimization levels. The default is -O1.
-pc32
-pc64
-pc80
Set precision of FPU significand to 32, 64, or 80 bits respectively
-p, -pg
Enable gprof-style sample-based profiling
-V
Display version information
-v
Verbose mode
-w
Suppress warning messages
Compilers
GNU Compilers
General Information:
The GNU C, C++ and Fortran compilers are available on LC's Linux clusters.
The GNU compilers are placed in your PATH and MANPATH environment
variables by default.
Installed under /usr/local/tools/gcc*.
There can be several versions of the GNU compilers available.
To see what is currently available on the machine you are using, issue
the use -l compilers command.
To select a particular version,
use the use package-name command.
Fortran compilers support all of the C compiler options, plus include
options specific to Fortran.
Support SSE instructions. Additionally, the GNU C/C++ compiler supports
Intel's AVX instruction set.
LC's GNU compilers on Linux systems are versions provided by RedHat, not vanilla versions available from GNU. They may perform differently or have different features than the vanilla GNU builds. Builds of GNU's versions of these compilers tend not to work on LC's Linux systems because of incompatibilities with binutils.
The GNU compilers offer over 1,200 options! A few useful compiler options are shown below. Interested users will definitely want to consult relevant man page(s) and GNU documentation.
Option
Description
-I, -L, -l, -lm, -c, -o
As usual
-ansi
Adhere to ANSI rules.
-fopenmp
Compile with OpenMP support.
-fmem-report -ftime-report -Q
Display to stdout some statistics (memory allocations, time used, etc.) after the compilation is completed.
-ftree-vectorize
-ftree-vectorizer-verbose=0-6
Turn on vectorization (use with -O1, -O2 or -O3) Generate vectorization report - levels 0 (none) to 6 (max info).
-g -gstring
Produce debugging information. string specifies a particular type of debugging information/format
-lacml
Link with the AMD Core Math Library for optimized BLAS, LAPACK and FFT
routines.
-mavx
Utilize the advanced vector extension (AVX) streaming SIMD instructions.
Sandy Bridge processor only. Not available with GNU Fortran.
-mcmodel=medium
Required if executable is greater than 2 GB. The program is linked in the lower 2 GB of the address space but symbols can be located anywhere in the address space. Programs can be statically or dynamically linked, but building of shared libraries are not supported with the medium model.
-mtune=cpu-type -march=cpu-type
Architecture specific tuning options - see man page. Use opteron for LC's Opteron clusters.
-O0, -O, -O1, -O2, -O3
Levels of optimization. -O0 is no optimization (default). -O and -O1 are the same. -O2 and -O3 are increasingly aggressive optimizations.
-p, -pg
Generate extra code to write profile information suitable for the analysis program gprof.
-pthread
Compile with pthreads support.
-v
Verbose mode. Display compiler version, paths, libraries, etc. information.
-Wstring -Wno-string
Turn on/off warning messages, where string describes the message type (see the man page)
Linux Clusters Overview Exercise 1
Getting Started
Overview:
Login to an LC cluster using your workshop username and OTP token
Copy the exercise files to your home directory
Familiarize yourself with the cluster's configuration
MPI-1 implementation that also includes support for MPI-I/O
Based on MPICH-1.2.7 MPI library from Argonne National Laboratory
Not thread-safe. All MPI calls should be made by the master thread in
a multi-threaded MPI program.
See /usr/local/docs/mpi.mvapich.basics for LC
usage details.
MVAPICH2
Multiple versions available
MPI-2 and MPI-3 implementations based on MPICH MPI library from Argonne
National Laboratory. Versions 1.9 and later implement MPI-3 according to the
developer's documentation.
Not currently the default - requires the "use" command to load the
selected dotkit package:
use -l mvapich(list available packages)use mvapich2-intel-2.1(use the package of interest)
Thread-safe
See /usr/local/docs/mpi.mvapich2.basics for LC
usage details.
MPI executables are launched using the SLURM srun command
with the appropriate options. For example, to launch an 8-process MPI job
split across two different nodes in the pdebug pool:
srun -N2 -n8 -ppdebug a.out
The srun command is discussed in detail in the
Running Jobs
section of the Linux Clusters Overview tutorial.
Be sure to load the same Open MPI dotkit package that you used to
build your executable. If you are running a batch job, you will need
to load the dotkit package in your batch script.
Launching an Open MPI job is done differently
than with MVAPICH MPI - the mpiexec command
is required. For example, to run a 48 process MPI job:
LC developed MPI compiler wrapper scripts are used to compile MPI programs
Automatically perform some error checks, include the appropriate
MPI #include files, link to the necessary MPI libraries, and pass options to
the underlying compiler.
For MPICH2 and Open MPI, you must first load the desired
dotkit package with the use
command. For example:
use -l openmpi(list available packages)use openmpi-gnu-1.8.4(use the package of interest)
Failing to do this will result in getting the MVAPICH 1.2 implementation.
For additional information:
See the man page (if it exists)
Issue the script name with the -help option
View the script yourself directly
MPI Build Scripts
Implementation
Language
Script Name
Underlying Compiler
MVAPCH 1.2
C
mpicc
gcc - GNU
mpigcc
gcc - GNU
mpiicc
icc - Intel
mpipgcc
pgcc - PGI
C++
mpiCC
g++ - GNU
mpig++
g++ - GNU
mpiicpc
icpc - Intel
mpipgCC
pgCC - PGI
Fortran
mpif77
g77 - GNU
mpigfortran
gfortran - GNU
mpiifort
ifort - Intel
mpipgf77
pgf77 - PGI
mpipgf90
pgf90 - PGI
MVAPCH2
C
mpicc
C compiler of dotkit package loaded
C++
mpicxx
C++ compiler of dotkit package loaded
Fortran
mpif77
Fortran77 compiler of dotkit package loaded
mpif90
Fortran90 compiler of dotkit package loaded
Open MPI
C
mpicc
C compiler of dotkit package loaded
C++
mpiCC mpic++ mpicxx
C++ compiler of dotkit package loaded
Fortran
mpif77
Fortran77 compiler of dotkit package loaded
mpif90
Fortran90 compiler of dotkit package loaded
Level of Thread Support
MPI libraries vary in their level of thread support:
MPI_THREAD_SINGLE - Level 0: Only one thread will execute.
MPI_THREAD_FUNNELED - Level 1:
The process may be multi-threaded, but only
the main thread will make MPI calls - all MPI calls are funneled
to the main thread.
MPI_THREAD_SERIALIZED - Level 2:
The process may be multi-threaded, and
multiple threads may make MPI calls, but only one at a time. That is,
calls are not made concurrently from two distinct threads as all MPI calls
are serialized.
MPI_THREAD_MULTIPLE - Level 3:
Multiple threads may call MPI with no restrictions.
A simple C language example for determining thread level support is shown below.
#include "mpi.h"
#include <stdio.h>
int main( int argc, char *argv[] )
{
int provided, claimed;
/*** Select one of the following
MPI_Init_thread( 0, 0, MPI_THREAD_SINGLE, &provided; );
MPI_Init_thread( 0, 0, MPI_THREAD_FUNNELED, &provided; );
MPI_Init_thread( 0, 0, MPI_THREAD_SERIALIZED, &provided; );
MPI_Init_thread( 0, 0, MPI_THREAD_MULTIPLE, &provided; );
***/
MPI_Init_thread(0, 0, MPI_THREAD_MULTIPLE, &provided; );
MPI_Query_thread( &claimed; );
printf( "Query thread level= %d Init_thread level= %d\n", claimed, provided );
MPI_Finalize();
}
Sample output:
Query thread level= 3 Init_thread level= 3
Running Jobs
Overview
Big Differences:
LC's Linux clusters can be divided into two types: those having a
high speed interconnect and those that don't. There are significant
differences that arise from this.
Interconnect
No Interconnect
Clusters
Most linux clusters
aztec, inca, rzcereal
Parallelism
Designated as a parallel resource.
A single job can span many nodes.
Designated for serial and single node parallel jobs only.
A job cannot span more than one node.
Node Sharing
Compute nodes are NOT shared with other users or jobs.
When your job runs, the allocated nodes are dedicated to your job alone.
Multiple users and their jobs can run on the same node simultaneously.
Can result in competition for resources such as CPU and memory.
Usage differences between these two different types of clusters will
be noted as relevant in the remainder of this tutorial.
Job Limits:
For all production clusters, there are defined job limits which vary from
cluster to cluster. The primary job limits apply to:
How many nodes/CPUs a job may use
How long a job may run
How many jobs may be actively run simultaneously per user
How many nodes a user may use across all running jobs
Most job limits are enforced by the batch system, such as Moab or SLURM.
Some job limits are enforced by a "good neighbor" policy
The easiest way to determine the job limits for a particular machine is to login
to that machine and use the command:
news job.lim.machinename
where machinename is the name of the machine you are logged
into.
Job limits are also documented on LC web pages
The LC "Machine Status" web pages: (LC internal).
Click on the machine name of interest and then look for
the "Batch Limits" link/section.
Go to mylc.llnl.gov. Click
on any machine name in the "My Accounts" section.
Further discussion, and a summary table of job limits for all production
machines are available in the Queue Limits section of the Moab tutorial.
Running Jobs
Batch Versus Interactive
Interactive Jobs:
Typically run in the pdebug queue. Some clusters
may have additional queues permitting interactive jobs.
Interactive jobs have lower limits:
Shorter time limit
Fewer number of nodes permitted
Should NOT be run on login nodes
Can be launched from the command line using the srun
command (covered later) or submitted as a job script to the
pdebug queue as a batch system using
the msub command (below).
May also be submitted to the pbatch queue
using the mxterm command.
Batch Jobs:
Most LC production Linux clusters use the Tri-lab Moab workload scheduler as
the top level-batch system, with SLURM underneath as the local resource
manager.
This section only provides a quick summary of batch usage on
Linux clusters. For details, see the Moab Tutorial.
Batch jobs are typically run in the larger pbatch partition.
Some clusters may have additional queues permitting batch jobs.
All batch jobs must be submitted in the form of a job control script
with the msub command. Examples:
msub myjobscript
msub myjobscript -q ppdebug -a mic
msub myjobscript -l walltime=45:00
Example Moab job submission job control script:
#!/bin/csh
##### Example Moab Job Control Script
##### These lines are for Moab
#MSUB -l nodes=16
#MSUB -l partition=zin
#MSUB -l walltime=2:00:00
#MSUB -A phys
#MSUB -q pbatch
#MSUB -m be
#MSUB -V
#MSUB -o /p/lscratch3/joeuser/par_solve/myjob.out
##### These are shell commands
# Display job information for possible diagnostic use
date
echo "Job id = $SLURM_JOBID"
hostname
sinfo
squeue
# Run info
cd /p/lscratch3/joeuser/par_solve
srun -n256 a.out
echo 'Done'
After successfully submitting a job, you may then check its progress and
interact with it (hold, release, alter, kill) by means of other batch
commands briefly listed below.
Interactive debugging of batch jobs is possible (covered later).
Note that srun options must precede your executable.
Interactive use example, from the login node command line.
Specifies 2 nodes (-N), 16 tasks (-n) and the interactive pdebug partition (-p):
% srun -N2 -n16 -ppdebug myexe
Batch use example requesting 16 nodes and 256 tasks (assumes nodes have 16
cores):
First create a job script that requests nodes via
#MSUB -l nodes and uses srun
to specify the number of tasks and launch the job.
#!/bin/csh
#MSUB -l nodes=16
#MSUB -l partition=zin
#MSUB -l walltime=2:00:00
#MSUB -q pbatch
# Run info and srun job launch
cd /p/lscratch3/joeuser/par_solve
srun -n256 a.out
echo 'Done'
Then submit the job script from the login node command line:
% msub myjobscript
Primary differences between batch and interactive usage:
Difference
Interactive
Batch
Where used:
From login node command line
In batch script
Partition:
Requires specification of an interactive partition, such
as pdebug with the -p flag
pbatch is default
Scheduling:
If there are available interactive nodes, job will run immediately. Otherwise,
it will queue up (fifo) and wait until there are enough free nodes to run it.
The batch scheduler handles when to run your job regardless of the number of
nodes available.
More Examples:
srun -n64 -ppdebug my_app
64 process job run interactively in pdebug partition
srun -N64 -n512 my_threaded_app
512 process job using 64 nodes. Assumes pbatch partition.
srun -N4 -n32 -c2 my_threaded_app
2 node, 32 process job with 2 cores (threads) per process. Assumes pbatch partition.
srun -N8 my_app
8 node job with a default value of one task per node
(8 tasks). Assumes pbatch partition.
srun -n128 -o my_app.out my_app
128 process job that redirects stdout to file my_app.out.
Assumes pbatch partition.
srun -n32 -ppdebug -i my.inp my_app
32 process interactive job; each process accepts input from a file
called my.inp instead of stdin
Behavior of srun-N and -n flags -
using 4 nodes in batch (#MSUB -l nodes=4), each of which has 16 cores:
srun options:
srun is a powerful command
with @100 options affecting a wide range of job parameters.
Many srun options may be set via @60 SLURM environment
variables. For example, SLURM_NNODES behaves like the -N option.
A short list of common srun options appears below.
srun man page
for details on options and environment variables.
Option
Description
--auto-affinity=[args]
Specifies how to bind tasks to cpus. Discussion available
HERE.
-c [#cpus/task]
The number of CPUs used by each process. Use this option if each
process in your code spawns multiple POSIX or OpenMP threads.
-d
Specify a debug level - integer value between 0 and 5
-i [file]
-o [file]
Redirect input/output to file specified
-I
Allocate CPUs immediately or fail. By default, srun blocks until
resources become available.
-J
Specify a name for the job
-l
Label - prepend task number to lines of stdout/err
-m block|cyclic
Specifies whether to use block (the default) or cyclic distribution
of processes over nodes
--multi-prog config_file
Run a job with different programs with/without
different arguments for each task.
Intended for MPMD (multiple program multiple data) model MPI programs.
Discussion available HERE.
-n [#processes]
Number of processes that the job requires
-N [#nodes]
Number of nodes on which to run job
-O
Overcommit - srun will refuse to allocate more than one process per CPU
unless this option is also specified
-p [partition]
Specify a partition (cluster) on which to run job
-s
Print usage stats as job exits
-v -vv -vvv
Increasing levels of verbosity
-V
Display version information
Clusters Without an Interconnect - Additional Notes:
The aztec, inca and rzcereal clusters fall into this category.
Don't forget that multiple users and jobs can run on a single node
For this reason, it is very important that you tell the Moab
scheduler how many cores your job requires:
For batch jobs, be sure to include in your job script:
#MSUB -l ttc=#cores
For interactive jobs, be sure to use the srun -c
flag (described above table)
MPI jobs: these are started with the srun command as
described above, but because there is no interconnect, jobs are limited to a
single node and communications are done in shared memory.
Pthreads and OpenMP jobs can be run as usual - keeping in mind the required
number of cores per node, and that a node may be shared with other users.
Interactive srun jobs launched from the command line
should normally be terminated with a SIGINT (CTRL-C):
The first CTRL-C will report the state of the tasks
A second CTRL-C within one second will terminate the tasks
If you started your job with srun's -q option, only
one CTRL-C is required.
WARNING: Do not use kill -9 to stop srun, as
it may not stop the parallel processes it spawned. Because these processes are
running on other nodes, ps -ef
will not show them after you terminate the
srun process, but they will still be there, using resources.
Batch:
For batch jobs, the mjobctl -c
and canceljob commands can be used to terminate
batch jobs.
More info HERE
and an example below:
Only a brief summary is provided here, and most commands have a number of
options. Please see the
Monitoring Jobs
section in the Moab tutorial for details and for additional commands.
showq - A Moab command
that displays all running, idle and blocked jobs across all
machines in a Moab grid. Options exist to filter a subset of these jobs. The
example below shows only running jobs on the cab cluster.
% showq -r -p cab
active jobs------------------------
JOBID S PAR EFFIC XFACTOR Q USERNAME ACCNT MHOST NODES REMAINING STARTTIME
892013 R cab 0.00 6.0 no x2wwwta1 hph cab167 4 00:00:16 Wed May 28 17:23:38
892014 R cab 0.00 6.0 no x2wwwta1 hph cab320 4 00:07:58 Wed May 28 17:31:20
892016 R cab 0.00 6.0 no deer456x hph cab225 4 00:07:58 Wed May 28 17:31:20
892007 R cab 0.00 6.0 no uu45r99b hph cab159 4 00:07:58 Wed May 28 17:31:20
...
[ output omitted here ]
...
897911 R cab 0.00 1.0 no g99y58 axvnv cab273 4 15:13:50 Thu May 29 08:37:12
897914 R cab 0.00 1.0 no g99y58 axvnv cab426 4 15:19:04 Thu May 29 08:42:26
897934 R cab 0.00 1.0 no l8nk6 zpinch cab56 13 15:20:07 Thu May 29 08:43:29
897935 R cab 0.00 1.0 no l8nk6 zpinch cab68 13 15:38:34 Thu May 29 09:01:56
897921 R cab 0.00 1.0 no be432min aprod cab135 16 15:47:10 Thu May 29 09:10:32
100 active jobs 19088 of 19648 processors in use by local jobs (97.15%)
1225 of 1228 nodes active (99.76%)
Total jobs: 100
checkjob - A Moab
command that displays detailed job state information and diagnostic output for a
selected job.
ju - An LC command
that shows partitions and job usage in a concise format. In the example below,
some output has been truncated due to its width.
% ju
Partition total down used avail cap Jobs
pdebug 64 0 55 9 86% bbnliu-16, pmorris-1, rreed-8....
pbatch 1048 5 818 225 78% ggk-1, vo4-320, griinman-25, ...
mjstat -
An LC command showing a cluster's running job queue. For example:
% mjstat
Scheduling pool data:
-----------------------------------------------------------
Pool Memory Cpus Total Usable Free Other Traits
-----------------------------------------------------------
pdebug 2000Mb 8 16 16 7 (null)
pbatch* 2000Mb 8 255 254 16 (null)
Running job data:
-------------------------------------------------------------------
JobID User Nodes Pool Status Used Master/Other
-------------------------------------------------------------------
15936 a3brg2 8 pbatch CG 6:00:42 sierra44
15662 turh 1 pbatch CG 11:48:06 sierra24
16316 g6asco 8 pbatch PD 0:00 (JobHeld)
14353 r4dd 16 pbatch PD 0:00 (JobHeld)
15922 turh 8 pbatch R 4:49:47 sierra57
...
[ output omitted here ]
...
164331 je44g 2 pdebug R 18:20 sierra8
164332 je44g 2 pdebug R 15:57 sierra16
sinfo - SLURM command for
displaying configuration and usage information. The "infinite" time limit for
pbatch is set because Moab is used to control the actual job time limit,
which may vary.
% sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
pdebug up 30:00 9 alloc cab[11-19]
pdebug up 30:00 7 idle cab[4-10]
pbatch* up infinite 7 drain* cab[21,154,793,845,856,858,862]
pbatch* up infinite 1 drain cab340
pbatch* up infinite 792 alloc cab[20,22-153,155-275,300-339,341-551,556-575,588-792,794-839,844,
846-855,857,859-861,863]
squeue - SLURM command
for displaying information about running jobs. In the example below, some output
has been truncated due to its width.
% squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
31250 pbatch RSRMV_3. rfffdder R 9:16:52 32 sierra[173-181...
30677 pbatch relax_Va a4ebb2 R 9:16:38 40 sierra[20-27,58-65...
...
[ output omitted here ]
...
30724 pbatch 6size.0. bccce R 4:29:33 64 sierra[28-38,87-98,...
29170 pbatch ttest5a. re5uuh R 4:23:02 16 sierra[66-81]
Running Jobs
Optimizing CPU Usage
Clusters with an Interconnect:
Fully utilizing the cores on a node requires that you use the right
combination of srun and Moab options, depending upon
what you want to do and which type of machine you are using.
MPI only: for example, if you are running on a cluster that has
16 cores per node, and you want your job to use all 16 cores on 4 nodes (16 MPI
tasks per node), then you would do something like:
Interactive
Moab Batch
srun -n64 -ppdebug a.out
#MSUB -l nodes=4
srun -n64 a.out
MPI with Threads:
If your MPI job uses POSIX or OpenMP threads within each node, you will
need to calculate how many cores will be required in addition to the
number of tasks.
For example, running on a cluster having 16 cores per node, an 8-task
job where each task creates 4 OpenMP threads, would need a total of 32 cores, or
2 nodes: 8 tasks * 4 threads / 16 cores/node = 2 nodes
Interactive
Moab Batch
srun -N2 -n8 -ppdebug a.out
#MSUB -l nodes=2
srun -N2 -n8 a.out
Don't forget that the default MPI on LC's Linux clusters is not
thread-safe, so the master thread must perform all MPI calls.
You can include multiple srun commands within your batch
job command script. For example, suppose that you were conducting a
scalability run on an 16 core/node Linux cluster.
You could allocate the maximum number of nodes
that you would use with #MSUB -l nodes= and then have a
series of srun commands that use varying numbers of nodes:
The issues for CPU utilization are different for clusters without a
switch for several reasons:
Jobs can be serial
Nodes may be shared with other users
Heavy utilization
Jobs are limited to one node
On these systems, the more important issue is over utilization of cores
rather than under utilization.
IMPORTANT: For batch jobs, you need to specify how many cores your
job will require. The way to do this is with the Moab ttc= option:
#MSUB -l ttc=#cpus
This allows Moab to effectively schedule the node
for other jobs if you don't use all of a node's cores.
This is particularly important for MPI jobs and threaded codes.
Failure to do this can/will result in:
The batch scheduler assuming by default that your job requires only
one core.
Other jobs being scheduled on the node you're using because the
scheduler thinks your job only needs one core.
Poor performance and jobs taking longer to run than expected because
of oversubscription. Jobs may be killed because they will take longer
than the time limit specified.
Upsetting other users because you've oversubscribed the node.
Wasting the time and effort of LC support staff because of upset
users calling about unexpected job behavior.
Important notes for Aztec and Inca users: Please see
"Running on the Aztec
and Inca Clusters" section of the Moab tutorial for important
information about running MPI jobs on these clusters.
Running Jobs
Memory Considerations
64-bit Architecture Memory Limit:
Because LC's Linux clusters employ a 64-bit architecture, 16 exabytes of
memory can be addressed - which is about 4 billion times more than 4 GB limit
of 32-bit architectures. By current standards, this is virtually unlimited
memory.
In reality, machines are usually configured with only GBs of memory, so any
address access that exceeds physical memory will result (on most systems)
with paging and degraded performance.
However, LC machines are an exception to this because they have no
local disk - see below.
LC's Diskless Linux Clusters:
LC's Linux clusters are configured with diskless compute nodes. This has
very important implications for programs that exceed physical memory. For
example, most compute nodes have 16-64 GB of physical memory.
Because compute nodes don't have disks, there is no virtual (swap) memory,
which means there is no paging. Programs that exceed physical memory will
terminate with an OOM (out of memory) error and/or segmentation fault.
Compiler, Shell, Pthreads and OpenMP Limits:
Compiler data structure limits are in effect, but may be handled
differently by different compilers.
Shell stack limits: most are set to "unlimited" by default.
Pthreads stack limits apply, and may differ between compilers.
OpenMP stack limits apply, and may differ between compilers.
MPI Memory Use:
All MPI implementations require additional memory use. This varies
between MPI implementations and between versions of any given
implementation.
The amount of memory used increases with the number of MPI tasks.
Determining how much memory the MPI library uses can be accomplished
by using various tools, such as TotalView's Memscape feature.
Large Static Data:
If your executable contains a large amount of static data, LC recommends
that you compile with the -mcmodel=medium option.
This flag allows the Data and .BSS sections of your executable to extend
beyond a default 2 GB limit.
True for all compilers (see the respective compiler man page).
Opterons Only: NUMA and Infiniband Adapter Considerations:
Each Opteron node has 4 Socket F processors - either dual-core or quad-core.
Each Socket F is directly connected to its own local DIMMs with a bandwidth
of 10.7 GB/s.
Shared memory access to the DIMMs of other sockets occurs over the
HyperTransport links at 8 GB/s (4 GB/s each direction).
To improve performance, LC "pins" MPI tasks to a specific CPU. This
promotes memory locality and reduction of shared memory accesses
across multiple HyperTransport links to DIMMs on other sockets.
Only 1 socket is directly connected to the Infiniband adapter. This
means that communications from other sockets need to traverse
more than one HyperTransport link to get to the IB adapter, with the
possibility of degraded bandwidth. Unofficial tests suggest up to 30%.
Running Jobs
Vectorization and Hyper-threading
Vectorization:
All of LC's Linux clusters and their compilers (Intel, PGI, GNU)
support Intel's Streaming SIMD Extensions (SSE, SSE2, SSE3...)
vector instructions.
The primary purpose of these instructions is to increase CPU
throughput by performing operations on vectors of data elements,
rather than on single data elements. For example:
Sandy Bridge-EP (TLCC2) processors, and later models, have 256-bit vector
registers designed to take advantage of Intel's Advanced Vector
Extension (AVX) instructions.
AVX instructions operate on 8 single precision or
4 double precision operands.
The vector registers on older processors are 128-bits, and can hold four
single precision, or two double precision operands.
To take advantage of the potential performance improvements offered
by vectorization, all you need to do is compile with the appropriate
compiler flags. Some recommendations are shown in the table below.
Compiler
Vector Flag
AVX Flag
Reporting
Intel
-vec (default)
-avx
-opt_report -vec_report
PGI
-fast
n/a
-Minfo=all
GNU
-O3
-mavx (C/C++ only)
-ftree-vectorizer-verbose=1
Vectorization is performed on eligible loops. Note that
not all loops are able to be vectorized. A number of
factors can prevent the compiler from vectorizing a loop,
such as:
Function calls
I/O operations
GOTOs in or out of the loop
Recurrences
Data dependence, such as needing a value from a previous
loop iteration
Complex coding (difficult loop analysis)
To view/confirm that a loop has been vectorized, you can
generate an assembler file and look at the instructions used.
For example:
On Intel processors, hyper-threading enables 2 hardware threads per
core. Hyper-threading is Intel's terminology for a more general technique
known as simultaneous multi-threading (SMT), which enables more than one
hardware thread to run concurrently on one core.
LC's Linux clusters have Intel's hyper-threading feature enabled, but it is
turned "off" by default. To use this feature, you must explicitly turn
hyper-threading on using the SLURM
--enable-hyperthreads
option of the srun command.
Example: on a 16-core node, check to see if hyper-threading is
turned on - it isn't by default, so only 16 cores are shown. Then
execute the same command using srun's
--enable-hyperthreads flag. Now 32 "cores" appear.
Note: If using mxterm to acquire your job's nodes, you
will need to enable hyperthreads by using the
srun --enable-hyperthreads flag when you launch your job
in the mxterm window.
Hyper-threading benefits some codes more than others. Tests performed
on some LC codes (pF3D, IMC, Ares) showed improvements in the 10-30% range.
Your mileage may vary.
mpifind - reports on how processes and threads are bound to cores.
Documentation at https://lc.llnl.gov/confluence/display/TLCC2/mpifind
(requires authentication).
Example output for a 4-process job, each with 4 threads. Output on the left is
with OMP_PROC_BIND unset, and on the right, with OMP_PROC_BIND=TRUE.
This section only touches on selected highlights. For more information users will definitely need to consult the relevant documentation mentioned below. Also, please consult the "Supported Software and Computing Tools" web page located at
computing.llnl.gov/code/content/software_tools.php.
TotalView:
TotalView is probably the most widely used debugger for parallel programs.
It can be used with C/C++ and Fortran programs and supports all common
forms of parallelism, including pthreads, openMP, MPI, accelerators and GPUs.
Starting TotalView for serial codes: simply issue the command:
totalview myprog
Starting TotalView for interactive parallel jobs:
Some special command line options are required to run a parallel job
through TotalView under SLURM. You need to run srun under
TotalView,
and then specify the -a flag followed by 1)srun options,
2)your program, and 3)your program flags (in that order). The general
syntax is:
totalview srun -a -n #processes -p pdebug myprog [prog args]
To debug an already running interactive parallel job, simply issue the
totalview command and then attach
to the srun process that started the job.
Debugging batch jobs is covered in LC's
TotalView tutorial and in the "Debugging in Batch"
section below.
DDT stands for "Distributed Debugging Tool", a product of Allinea Software
Ltd.
DDT is a comprehensive graphical debugger designed specifically for
debugging complex parallel codes. It is supported on a variety of platforms
for C/C++ and Fortran. It is able to be used to debug multi-process MPI
programs, and multi-threaded programs, including OpenMP.
Currently, LC has a limited number of fixed and floating licenses for OCF and
SCF Linux machines.
Local copies of documentation under /usr/local/docs/ddt
STAT - Stack Trace Analysis Tool:
The Stack Trace Analysis Tool gathers and merges stack traces from a
parallel application's processes.
STAT is particularly useful for debugging hung programs.
It produces call graphs: 2D spatial and 3D spatial-temporal
The 2D spatial call prefix tree represents a single snapshot of the
entire application (see image).
The 3D spatial-temporal call prefix tree represents a series of
snapshots from the application taken over time.
In these graphs, the nodes are labeled by function names. The directed
edges, showing the calling sequence from caller to callee, are labeled by
the set of tasks that follow that call path. Nodes that are visited by the
same set of tasks are assigned the same color, giving a visual reference to
the various equivalence classes.
This tool should be in your default path as:
/usr/local/bin/stat-gui - GUI
/usr/local/bin/stat-cl - command line
/usr/local/bin/stat-view - viewer for DOT format
output files
Several other common debuggers are available on LC Linux clusters, though
they are not recommended for parallel programs when compared to TotalView
and DDT.
DDD: GNU DDD debugger is a graphical front-end for command-line debuggers
such as GDB, DBX, WDB, Ladebug, JDB, XDB, the Perl debugger, the bash
debugger, or the Python debugger. Documentation:
http://www.gnu.org/software/ddd
Debugging in Batch: mxterm:
Debugging batch parallel jobs on LC production clusters
is fairly straightforward. The main idea is that you need to submit a batch
job that gets your partition allocated and running.
Once you have your partition, you can login to any of the nodes within it,
and then starting running as though your in the interactive pdebug
partition.
For convenience, LC has developed a utilty called
mxterm which makes the
process even easier.
How to use mxterm:
If you are on a Windows PC, start your X11 application (such as
X-Win32)
Make sure you enable X11 tunneling for your ssh session
ssh and login to your cluster
Issue the command as follows:
mxterm #nodes#tasks#minutes
Where:
#nodes = number of nodes your job requires
#tasks = number of tasks your job requires
#minutes = how long you need to keep your partition for debugging
This will submit a batch job for you that will open an xterm when it
starts to run.
After the xterm appears, cd to the directory with
your source code and begin your debug session.
This utility does not have a man page, however you can view the
usage information by simple typing the name of the command.
Core Files:
It is quite likely that your shell's core file size setting may limit
the size of a core file so that it is inadequate for debugging,
especially with TotalView.
To check your shell's limit settings, use either the
limit (csh/tcsh) or
ulimit -a (sh/ksh/bash) command. For example:
To override your default core file size setting, use one of the following
commands:
csh/tcsh
unlimit -or-
limit coredumpsize 64
sh/ksh/bash
ulimit -c 64
Some users have complained that for many-process jobs, they actually
don't want core files or only want small core files because normal core
files can fill up their disk space. The limit
(csh/tcsh) or ulimit -c (sh/ksh/bash) commands can
be used as shown above to set smaller / zero sizes.
A Few Additional Useful Debugging Hints:
Add the sinfo and
squeue commands to your
batch scripts to assist in diagnosing problems. In particular, these
commands will document which nodes your job is using.
Also add the -l option to your
srun command so that
output statements are prepended with the task number that created them.
Be sure to check the exit status of all I/O operations when reading
or writing files in Lustre. This will allow you to detect any I/O problems
with the underlying OST servers.
If you know/suspect that there are problems with particular nodes,
you can use the srun -x option to skip these nodes.
For example:
srun -N12 -x "cab40 cab41" -ppdebug myjob
Tools
We Need a Book!
The subject of "Tools" for Linux cluster applications is far too broad and
deep to cover here. Instead, a few pointers are being provided for those
who are interested in further research.
The first place to check are LC's "Supported Software and Computing Tools"
web pages at:
computing.llnl.gov/code/content/software_tools.php
for what may be available here. Some example tools are listed below.
Memory Correctness Tools:
Memcheck
Valgrind's Memcheck tool detects a comprehensive set of memory errors, including reads and writes of unallocated or freed memory and memory leaks.
TotalView
Allows you to stop execution when heap API problems occur, list memory leaks, paint allocated and deallocated blocks, identify dangling pointers, hold onto deallocated memory, graphically browse the heap, identify the source line and stack backtrace of an allocation or deallocation, summarize heap use by routine, filter and dump heap information, and review memory usage by process or by library.
memP
The memP tool provides heap profiling through the generation of two reports: a summary of the heap high-water-mark across all processes in a parallel job as well as a detailed task-specific report that provides a snapshot of the heap memory currently in use, including the amount allocated at specific call sites.
Intel Inspector
Primarily a thread correctness tool, but memory debugging features are included.
Profiling, Tracing and Performance Analysis:
Open|SpeedShop
Open|SpeedShop is a comprehensive performance tool set with a unified look and feel that covers most important performance analysis steps. It offers various different interfaces, including a flexible GUI, a scripting interface, and a Python class. Supported experiments include profiling using PC sampling, inclusive and exclusive execution time analysis, hardware counter support, as well as MPI, I/O, and floating point exception tracing. All analysis is applied on unmodified binaries and can be used on codes with MPI and/or thread parallelism.
TAU
TAU is a robust profiling and tracing tool from the University of Oregon that includes support for MPI and OpenMP. TAU provides an instrumentation API, but source code can also be automatically instrumented and there is support for dynamic instrumentation as well. TAU is generally viewed as having a steep learning curve, but experienced users have applying the tool with good results at LLNL. TAU can be configured with many feature combinations. If the features you are interested in are not available in the public installation, please request the appropriate configuration through the hotline. TAU developer response is excellent, so if you are encountering a problem with TAU, there is a good chance it can be quickly addressed.
HPCToolkit
HPCToolkit is an integrated suite of tools for measurement and analysis of program performance on computers ranging from multicore desktop systems to the largest supercomputers. It uses low overhead statistical sampling of timers and hardware performance counters to collect accurate measurements of a program's work, resource consumption, and inefficiency and attributes them to the full calling context in which they occur.
HPCToolkit works with C/C++/Fortran applications that are either statically or dynamically linked. It supports measurement and analysis of serial codes, threaded codes (pthreads, OpenMP), MPI, and hybrid (MPI + threads) parallel codes.
mpiP
A lightweight MPI profiling library that provides time spent in MPI functions by callsite and stacktrace. This tool is developed and maintained at LLNL, so support and modifications can be quickly addressed. New run-time functionality can be used to generate mpiP data without relinking through the srun-mpip and poe-mpip scripts on Linux and AIX systems.
gprof
Displays call graph profile data. The gprof command is useful in identifying how a program consumes CPU resources. Gprof does simple function profiling and requires that the code be built and linked with -pg. For parallel programs, in order to get a unique output file for each process, you will need to set the undocumented environment variable GMON_OUT_PREFIX to some non-null string. For example:
setenv GMON_OUT_PREFIX 'gmon.out.'`/bin/uname -n`
pgprof
PGI profiler - pgprof is a tool which analyzes data generated during execution of specially compiled programs. This information can be used to identify which portions of a program will benefit the most from performance tuning.
PAPI
Portable hardware performance counter library.
PapiEx
A PAPI-based performance profiler that measures hardware performance events of an application without having to instrument the application.
VTune Amplifier
The Intel VTune Amplifier tool is a performance analysis tool for finding hotspots in serial and multithreaded codes. Note the installation on LC machines does not include the advanced hardware analysis capabilities.
Intel Profiler
The Intel Profiler tool is built into the Intel compiler along with a simple GUI to display the collected results.
Vampir / Vampirtrace
Full featured trace file visualizer and library for generating trace files for parallel programs.
Beyond LC:
Beyond LC, the web offers endless opportunities for discovering tools
that aren't available here.
In many cases, users can install tools in their own directories if LC doesn't
have/support them.
Linux Clusters Overview Exercise 2
Parallel Programs and More
Overview:
Login to an LC workshop cluster, if you are not already logged in
Build and run parallel MPI programs
Build and run parallel Pthreads programs
Build and run parallel OpenMP programs
Build and run a parallel benchmarks
Build and run an MPI message passing bandwidth test
Photos/Graphics: Have been created by the author,
created by other LLNL employees, obtained from non-copyrighted sources,
or used with the permission of authors from other presentations and
web pages.