Running Parallel Programs

An Introduction to Parallel Programming APIs

Programmers are generally familiar with serial, or sequential, programs. Simple programs — like "Hello World" and the basic suite of searching and sorting programs — are typical of sequential programs. They have a beginning, an execution sequence, and an end; at any time during the run, the program is executing only at a single point.

A thread is similar to a sequential program, in that it also has a beginning, an execution sequence, and an end. At any time while a thread is running, there is a single point of execution. A thread differs in that it isn't a stand-alone program; it runs within a program. The concept of threads becomes important when a program has multiple threads running at the same time and performing different tasks.

To run in parallel means that more than one thread of execution is running at the same time, often on different processors of one computer; in the case of a cluster, the threads are running on different computers. A few things are required to make parallelism work and be useful: The program must migrate to another computer or computers and get started; at some point, the data upon which the program is working must be exchanged between the processes.

The simplest case is when the same single-process program is run with different input parameters on all the nodes, and the results are gathered at the end of the run. Using a cluster to get faster results of the same non-parallel program with different inputs is called parametric execution.

A much more complicated example is a simulation, where each process represents some number of elements in the system. Every few time steps, all the elements need to exchange data across boundaries to synchronize the simulation. This situation requires a message passing interface or MPI.

To solve these two problems — program startup and message passing — you can develop your own code using POSIX interfaces. Alternatively, you could utilize an existing parallel application programming interface (API), such as the Message Passing Interface (MPI) or the Parallel Virtual Machine (PVM). These are discussed in the sections that follow.

MPI

The Message Passing Interface (MPI) application programming interface is currently the post popular choice for writing parallel programs. The MPI standard leaves implementation details to the system vendors (like Scyld). This is useful because they can make appropriate implementation choices without adversely affecting the output of the program.

A program that uses MPI is automatically started a number of times and is allowed to ask two questions: How many of us (size) are there, and which one am I (rank)? Then a number of conditionals are evaluated to determine the actions of each process. Messages may be sent and received between processes.

The advantages of MPI are that the programmer:

  • Doesn't have to worry about how the program gets started on all the machines

  • Has a simplified interface for inter-process messages

  • Doesn't have to worry about mapping processes to nodes

  • Abstracts the network details, resulting in more portable hardware-agnostic software

Also see the section on running MPI-aware programs later in this chapter. Scyld ClusterWare includes several implementations of MPI, including the following:

MPICH. Scyld ClusterWare includes MPICH, a freely-available implementation of the MPI standard. MPICH is a project managed by Argonne National Laboratory and Mississippi State University. For more information on MPICH, visit the MPICH web site at http://www-Unix.mcs.anl.gov/mpi/mpich. Scyld MPICH is modified to use BProc and Scyld job mapping support; see the section on job mapping later in this chapter.

MVAPICH. MVAPICH is an implementation of MPICH for Infiniband interconnects. As part of our Infiniband product, Scyld ClusterWare provides a version of MVAPICH, modified to use BProc and Scyld job mapping support; see the section on job mapping later in this chapter. For information on MVAPICH, visit the MVAPICH web site at http://nowlab.cse.ohio-state.edu/projects/mpi-iba

LAM. Local Area Multicomputer (LAM) is another open-source implementation of the Message Passing Interface (MPI) specification, intended for both production and research use. LAM includes a rich set of features for system administrators, parallel programmers, application users, and parallel computing researchers. For more information on LAM, visit the LAM-MPI web site at http://www.lam-mpi.org. Scyld ClusterWare provides a version of LAM, modified to use BProc and Scyld job mapping support; see the section on job mapping later in this chapter. Also see the section on running LAM-aware programs later in this chapter

Other MPI Implementations. Various commercial MPI implementations also run on Scyld ClusterWare. See the Scyld MasterLink site at http://www.penguincomputing.com/ScyldSupport. You can also workload and build your own version of MPI, and configure it to run on Scyld ClusterWare.

PVM

Parallel Virtual Machine (PVM) was an earlier parallel programming interface. Unlike MPI, it is not a specification but a single set of source code distributed on the Internet. PVM reveals much more about the details of starting your job on remote nodes. However, it fails to abstract implementation details as well as MPI does.

PVM is deprecated, but is still in use by legacy code. We generally advise against writing new programs in PVM, but some of the unique features of PVM may suggest its use.

Also see the section on running PVM-aware programs later in this chapter.

Custom APIs

As mentioned earlier, you can develop you own parallel API by using various Unix and TCP/IP standards. In terms of starting a remote program, there are programs written:

  • Using the rexec function call

  • To use the rexec or rsh program to invoke a sub-program

  • To use Remote Procedure Call (RPC)

  • To invoke another sub-program using the inetd super server

These solutions come with their own problems, particularly in the implementation details. What are the network addresses? What is the path to the program? What is the account name on each of the computers? How is one going to load-balance the cluster?

Scyld ClusterWare, which doesn't have binaries installed on the cluster nodes, may not lend itself to these techniques. We recommend you write your parallel code in MPI. That having been said, we can say that Scyld has some experience with getting rexec() calls to work, and that one can simply substitute calls to rsh with the more cluster-friendly bpsh.

Mapping Jobs to Compute Nodes

Running programs specifically designed to execute in parallel across a cluster requires at least the knowledge of the number of processes to be used. Scyld ClusterWare uses the NP environment variable to determine this. The following example will use 4 processes to run an MPI-aware program called a.out, which is located in the current directory.

[user@cluster username]$ NP=4 ./a.out

Note that each kind of shell has its own syntax for setting environment variables; the example above uses the syntax of the Bourne shell (/bin/sh or /bin/bash).

What the example above does not specify is which specific nodes will execute the processes; this is the job of the mapper. Mapping determines which node will execute each process. While this seems simple, it can get complex as various requirements are added. The mapper scans available resources at the time of job submission to decide which processors to use.

Scyld ClusterWare includes beomap, a mapping API (documented in the Programmer's Guide with details for writing your own mapper). The mapper's default behavior is controlled by the following environment variables:

You can use the beomap program to display the current mapping for the current user in the current environment with the current resources at the current time. See the Reference Guide for a detailed description of beomap and its options, as well as examples for using it.

Running MPI-Aware Programs

MPI-aware programs are those written to the MPI specification and linked with the Scyld MPICH library. This section discusses how to run MPI-aware programs and how to set mapping parameters from within such programs. Examples of running MPI-aware programs are also provided.

For information on building MPI programs, see the Programmer's Guide.

mpirun

Almost all implementations of MPI have an mpirun program, which shares the syntax of mpprun, but which boasts of additional features for MPI-aware programs.

In the Scyld implementation of mpirun, all of the options available via environment variables or flags through directed execution are available as flags to mpirun, and can be used with properly compiled MPI jobs. For example, the command for running a hypothetical program named my-mpi-prog with 16 processes:

[user@cluster username]$ mpirun -np 16 my-mpi-prog arg1 arg2

is equivalent to running the following commands in the Bourne shell:

[user@cluster username]$ export NP=16
my-mpi-prog arg1 arg2

Setting Mapping Parameters from Within a Program

A program can be designed to set all the required parameters itself. This makes it possible to create programs in which the parallel execution is completely transparent. However, it should be noted that this will work only with Scyld ClusterWare, while the rest of your MPI program should work on any MPI platform.

Use of this feature differs from the command line approach, in that all options that need to be set on the command line can be set from within the program. This feature may be used only with programs specifically designed to take advantage of it, rather than any arbitrary MPI program. However, this option makes it possible to produce turn-key application and parallel library functions in which the parallelism is completely hidden.

Following is a brief example of the necessary source code to invoke mpirun with the -np 16 option from within a program, to run the program with 16 processes:

/* Standard MPI include file */
# include <mpi.h>

main(int argc, char **argv) {
        setenv("NP","16",1); // set up mpirun env vars
        MPI_Init(&argc,&argv); 
        MPI_Finalize();
}

More details for setting mapping parameters within a program are provided in the Programmer's Guide.

Examples

The examples in this section illustrate certain aspects of running a hypothetical MPI-aware program named my-mpi-prog.

Example 7. Specifying the Number of Processes

This example shows a cluster execution of a hypothetical program named my-mpi-prog run with 4 processes:

[user@cluster username]$ NP=4 ./my-mpi-prog

An alternative syntax is as follows:

[user@cluster username]$ NP=4
export NP
./my-mpi-prog

Note that the user specified neither the nodes to be used nor a mechanism for migrating the program to the nodes. The mapper does these tasks, and jobs are run on the nodes with the lowest CPU utilization.

Example 8. Excluding Specific Resources

In addition to specifying the number of processes to create, you can also exclude specific nodes as computing resources. In this example, we run my-mpi-prog again, but this time we not only specify the number of processes to be used (NP=6), but we also exclude of the master node (NO_LOCAL=1) and some cluster nodes (EXCLUDE=2:4:5) as computing resources.

[user@cluster username]$ NP=6 NO_LOCAL=1 EXCLUDE=2:4:5 ./my-mpi-prog

Running LAM-Aware Programs

LAM-aware programs are those written to the LAM-MPI specification. This section provides information needed to use programs with LAM-MPI as implemented in Scyld ClusterWare.

Pre-Requisites to Running LAM

A number of commands, such as mpicc, are duplicated between MPICH and LAM. To be sure that you are accessing the LAM commands and documentation, rather than the MPICH commands and documentation, for these duplicates:

  • Set LD_LIBRARY_PATH to /usr/lib64/LAM/gnu:$LD_LIBRARY_PATH

    Change "gnu" to "pgi" if desired. It is important to get the correct compiler-specific directory, otherwise the libraries will not be found.

  • Set PATH to /usr/lam/bin/gnu:/usr/lam/bin:$PATH

    Change "gnu" to "pgi" if desired. It is important to get the correct compiler-specific directory, otherwise compilation will fail. It is important to set both of the above directories, and to put both of them before anything else in your PATH

  • Set MANPATH to /usr/lam/share/man:$MANPATH

    This gets you the correct LAM documentation (not the MPICH documentation) for duplicated commands.

Starting and Ending a LAM Session

To start LAM jobs, set the appropriate Scyld Beowulf environment variables, then run the lamboot command. The beomap is "cast in stone" at lamboot time, and will remain set throughout your LAM session, until you run lamhalt to end your LAM session.

To run LAM jobs during your LAM session, keep these two things in mind:

  • Because of a current LAM limitation, your job executable must exist on the compute nodes, as it will not automatically be copied over. We recommend running LAM executables from a shared NFS directory, such as /home.

  • Do not forget the "C" in the middle of the mpirun command, such as mpirun C ./mylamjob. This is a LAM-specific syntax required by LAM mpirun, and is very different from MPICH mpirun.

Read the LAM mpirun man page for more information about this syntax.

To end your LAM session, run the lamhalt command when you are finished running your jobs.

After cleanly stopping LAM with lamhalt, you may freely change your beomap if desired, before starting LAM again with lamboot.

Running PVM-Aware Programs

Parallel Virtual Machine (PVM) is an application programming interface for writing parallel applications, enabling a collection of heterogeneous computers to be used as a coherent and flexible concurrent computational resource. Scyld has developed the Scyld PVM library, specifically tailored to allow PVM to take advantage of the technologies used in Scyld ClusterWare. A PVM-aware program is one that has been written to the PVM specification and linked against the Scyld PVM library.

A complete discussion of cluster configuration for PVM is beyond the scope of this document. However, a brief introduction is provided here, with the assumption that the reader has some background knowledge on using PVM.

You can start the master PVM daemon on the master node using the PVM console, pvm. To add a compute node to the virtual machine, issue an add .# command, where "#" is replaced by a node's assigned number in the cluster.

Tip

You can generate a list of node numbers using either of the beosetup or bpstat commands.

Alternately, you can start the PVM console with a hostfile filename on the command line. The hostfile should contain a ".#" for each compute node you want as part of the virtual machine. As with standard PVM, this method automatically spawns PVM slave daemons to the specified compute nodes in the cluster. From within the PVM console, use the conf command to list your virtual machine's configuration; the output will include a separate line for each node being used. Once your virtual machine has been configured, you can run your PVM applications as you normally would.

Porting Other Parallelized Programs

Programs written for use on other types of clusters may require various levels of change to function with Scyld ClusterWare. For instance:

.

For more information on porting applications, see the Programmer's Guide