SHMEM refers to Cray's shared memory access library. First, let's clarify that the T3E is a distributed memory, shared nothing parallel computer even in light of the reference above to shared memory. What the SHMEM library provides is access to the T3E hardware's cability to have a processor read and write the memory of another processor without that processor's cooperation. This is called active messaging. The use of this hardware via the SHMEM calls is the subject of this paper.
Active messaging via SHMEM contrasts with message passing via MPI or PVM, in which a message is sent in a two-step process: the source processor makes a call to send the data and the remote processor makes a call to receive it. Cooperation is required between two processors in that the destination processor must make a library call to accept the data before using it.
With SHMEM active messaging, the sending of data involves only one CPU in that the source processor simply puts that data into the memory of the destination processor. Likewise, a processor can read data from another processor's memory without interrupting the remote CPU. The remote processor is not made aware that its memory has been read or written, unless the programmer implements a mechanism to accomplish this.
On the T3E, support circuitry is provided to read, store, and write data during active messages. Subroutines in the SHMEM library use this hardware at a low level to provide an efficient, low-overhead communication mechanism.
In short, you will only want to use the SHMEM library if your program can not be made to perform adequately using a portable communications library such as MPI or PVM. Once you optimize single-processor performance and I/O performance to suit your needs, determine if your message passing overhead is small enough by doing a scaling analysis.
If your program does not scale to use a significant number of processors, you might be able to improve scaling by switching MPI or PVM calls to SHMEM calls, thereby reducing parallel overhead. Normally this is done only for the communications-intensive parts of your calculation. You don't have to convert all communications to SHMEM, just the components that need to perform more efficiently.
The most obvious benefit of using SHMEM, as compared to MPI or PVM, is that the SHMEM calls have significantly lower startup latencies and higher bandwidths. When used in a communication-intensive part of a program, SHMEM might result in better scaling of that portion of the calculation.
Note that MPI is implemented using SHMEM calls, and so the performance improvements come from incurring less overhead, such as that from message packetization. MPI and PVM can both be optimized somewhat by controlling the environment variables that specify the amount of memory buffer space available. Nevertheless, you may find significant performance improvements by using SHMEM if your program is communications-intensive.
Another benefit is that, with active messages, the remote processor is not interrupted during its calculation. Keep in mind that a modern RISC processor can easily perform thousands of floating-point operations in the time it takes to transmit even a small message! By letting the support circuitry handle message passing for the remote processor, you decrease parallel overhead.
Finally, the support circuitry provides 512 E-registers that can be used to store data and retrieve data with perhaps a significant performance improvement over the secondary cache. There are SHMEM calls to use these variables, but the simplest way to use these registers is through the compiler directives
!dir$ cache_bypass array-name (Fortran) #pragma cache_bypass array-name (C)
These directives instruct the compiler to generate code to store the array you name in the directive in the user E-registers.
The main problem with using SHMEM is that this library only exists on CRAY MPP systems. MPI-2 will have support for active messages, but it remains to be seen how easily SHMEM calls will map to the new MPI-2 calls, and how well the MPI-2 active messages will perform in comparison with SHMEM.
Another difficulty with active messages is that the remote processor is not notified when its memory has been read or written. It is up to the programmer to use a message or other type of indicator to alert a processor that new data is available or that old data has been consumed and can now be overwritten.
If you used SHMEM on a T3D system, you will no doubt find the use of SHMEM simpler on the T3E. The main reason is that programmers no longer have to provide for cache coherence when using the put operation. Another reason is that the put and get operations now have very similar performance, and it is no longer important to use only put instead of get operations.
To use SHMEM calls for data transfer, you must know the address of a variable on the remote processor. Several types of variables are guaranteed to have the same address on different processors. Such variables are called symmetric variables.
The following are guaranteed to be symmetric variables:
Variables that are dynamically allocated or stack allocated, such as local variables in a subroutine, are not guaranteed to occupy the same address on different processors. These variables, therefore, are not useful in SHMEM calls.
In the Fortran calling sequences below, source and target variables are declared as REAL variables, but they may be any 64-bit type. (There are also 32-bit versions of some of the SHMEM calls. See man shmem for details.)
Most of the SHMEM calls use a work array, called pSync or Pwrk below, and the size of this array varies by call. Check the man pages for information on the sizes you need to allocate.
shmem_barrier_all implements a barrier. All processors must call shmem_barrier_all, and once all have made the call, all processors may proceed. (See also shmem_barrier, which implements a barrier on a subset of the processors.)
shmem_barrier_all(); CALL SHMEM_BARRIER_ALL
shmem_put writes data into the memory of a remote processor. It does not require cooperation from the remote processor to complete the task. Cache memory coherence is provided by the system.
int shmem_put(target, source, len, pe); long *target, *source; int len, pe; INTEGER SHMEM_PUT(target, source, len, pe) REAL target, source INTEGER len, pe
shmem_get reads data from the memory of a remote processor. It does not require cooperation from the remote processor to complete the task.
int shmem_get(target, source, len, pe); long *target, *source; int len, pe; INTEGER SHMEM_GET(target, source, len, pe) REAL target, source INTEGER len, pe
shmem_broadcast broadcasts data from one processor to some or all of the other processors. The arguments to the call describe which processors participate, and each of these processors must make the call to shmem_broadcast or the program will hang. The root processor sends the data and all other participating processors receive it.
void shmem_broadcast(target, source, len, PE_root, PE_start, logPE_stride, PE_size, pSync); long *target, *source, pSync; int len, PE_root, PE_start, logPE_stride, PE_size; CALL SHMEM_BROADCAST(target, source, len, PE_root, PE_start, logPE_stride, PE_size, pSync) REAL target, source, pSync INTEGER len, PE_root, PE_start, logPE_stride, PE_size
shmem_float_sum_to_all and SHMEM_REAL8_SUM_TO_ALL perform global sums. Each participating processor contributes a value, and the values from each processor are summed to form the global sum. The values from each processor can be a scalar value (nreduce=1) or an array.
void shmem_float_sum_to_all(target, source, nreduce, PE_start, logPE_stride, PE_size, pWrk, pSync); float *target, *source, *pWrk, *pSync; int nreduce, PE_start, logPE_stride, PE_size; CALL SHMEM_REAL8_SUM_TO_ALL(target, source, nreduce, PE_start, logPE_stride, PE_size, pWrk, pSync) REAL target, source, pWrk INTEGER nreduce, PE_start, logPE_stride, PE_size, pSync
Many other calls are provide in the SHMEM library. See the man page on shmem for more information.
A simple example program shmem.f is available on the T3E in /usr/local/examples.