This chapter describes how to run both serial and parallel jobs with Scyld ClusterWare, and how to monitor the status of the cluster once your applications are running. It begins with a brief discussion of program execution concepts, including some examples. The discussion then covers running programs that aren't parallelized, running parallel programs (including MPI-aware, LAM-aware, and PVM-aware programs), running serial programs in parallel, job batching, and file systems. Finally, the chapter covers the sample linpack and mpi-mandel programs included with Scyld ClusterWare.
This section compares program execution on a stand-alone computer and a Scyld cluster. It also discusses the differences between running programs on a traditional Beowulf cluster and a Scyld cluster. Finally, it provides some examples of program execution on a Scyld cluster.
On a stand-alone computer running Linux, Unix, and most other operating systems, executing a program is a very simple process. For example, to generate a list of the files in the current working directory, you open a terminal window and type the command ls followed by the [return] key. Typing the [return] key causes the command shell — a program that listens to and interprets commands entered in the terminal window — to start the ls program (stored at /bin/ls). The output is captured and directed to the standard output stream, which also appears in the same window where you typed the command.
A Scyld cluster isn't simply a group of networked stand-alone computers. Only the master node resembles the computing system with which you are familiar. The compute nodes have only the minimal software components necessary to support an application initiated from the master node. So for instance, running the ls command on the master node causes the same series of actions as described above for a stand-alone computer, and the output is for the master node only.
However, running ls on a compute node involves a very different series of actions. Remember that a Scyld cluster has no resident applications on the compute nodes; applications reside only on the master node. So for instance, to run the ls command on compute node 1, you would enter the command bpsh 1 ls on the master node. This command sends ls to compute node 1 via Scyld's BProc software, and the output stream is directed to the terminal window on the master node, where you typed the command.
Some brief examples of program execution are provided in the last section of this chapter. Both BProc and bpsh are covered in more detail in the Administrator's Guide.
A job on a Beowulf cluster is actually a collection of processes running on the compute nodes. In traditional clusters of computers, and even on earlier Beowulf clusters, getting these processes started and running together was a complicated task. Typically, the cluster administrator would need to do all of the following:
Ensure that the user had an account on all the target nodes, either manually or via a script.
Ensure that the user could spawn jobs on all the target nodes. This typically entailed configuring a hosts.allow file on each machine, creating a specialized PAM module (a Linux authentication mechanism), or creating a server daemon on each node to spawn jobs on the user's behalf.
Copy the program binary to each node, either manually, with a script, or through a network file system.
Ensure that each node had available identical copies of all the dependencies (such as libraries) needed to run the program.
Provide knowledge of the state of the system to the application manually, through a configuration file, or through some add-on scheduling software.
With Scyld ClusterWare, most of these steps are removed. Jobs are started on the master node and are migrated out to the compute nodes via BProc. A cluster architecture where jobs may be initiated only from the master node via BProc provides the following advantages:
Users no longer need accounts on remote nodes.
Users no longer need authorization to spawn jobs on remote nodes.
Neither binaries nor libraries need to be available on the remote nodes.
The BProc system provides a consistent view of all jobs running on the system.
With all these complications removed, program execution on the compute nodes becomes a simple matter of letting BProc know about your job when you start it. The method for doing so depends on whether you are launching a parallel program (for example, an MPI job or PVM job) or any other kind of program. See the sections on running parallel programs and running non-parallelized programs later in this chapter.
This section provides a few examples of program execution with Scyld ClusterWare. Additional examples are provided in the sections on running parallel programs and running non-parallelized programs later in this chapter.
Example 1. Directed Execution with bpsh
In the directed execution mode, the user explicitly defines which node (or nodes) will run a particular job. This mode is invoked using the bpsh command, the ClusterWare shell command analogous in functionality to both the rsh (remote shell) and ssh (secure shell) commands. Following are two examples of using bpsh.
This example runs hostname on the compute node and writes the output back to the user's screen from compute node 0:
[user@cluster username]$ bpsh 0 /bin/hostname .0
This example runs the uptime utility on node 0, assuming it is installed in /usr/bin:
[user@cluster username]$ bpsh 0 /usr/bin/uptime 12:56:44 up 4:57, 5 users, load average: 0.06, 0.09, 0.03
Example 2. Dynamic Execution with beorun and mpprun
In the dynamic execution mode, Scyld decides which node is the most capable of executing the job at that moment in time. Scyld includes two parallel execution tools that dynamically select nodes, beorun and mpprun. They differ only in that beorun runs the job on the selected nodes concurrently, while mpprun runs the job sequentially on one node at a time.
The following example shows the difference in the elapsed time to run a command with beorun vs. mpprun:
[user@cluster username]$ date;beorun -np 8 sleep 1;date Fri Aug 18 11:48:30 PDT 2006 Fri Aug 18 11:48:32 PDT 2006
[user@cluster username]$ date;mpprun -np 8 sleep 1;date Fri Aug 18 11:48:46 PDT 2006 Fri Aug 18 11:48:54 PDT 2006
Example 3. Binary Pre-Staged on Compute Node
A needed binary can be "pre-staged" by copying it to a compute node prior to execution of a shell script. In the following example, the shell script is in a file called test.sh:
###### #! /bin/bash hostname ####### [user@cluster username]$ bpcp /bin/hostname 1:/bin/hostname [user@cluster username]$ bpsh 1 ./test.sh .1
This makes the hostname binary available on compute node 1 before the script is executed. The shell's $PATH contains /bin, so the compute node searches locally for hostname in $PATH, finds it, and executes it.
Note that copying files to the compute nodes generally puts the files in a RAM disk, thus consuming RAM that might otherwise be used for programs.
Example 4. Binary Migrated to Compute Node
If a binary is not "pre-staged" on a compute node, the full path to the binary must be included in the script in order to execute properly. In the following example, the master node starts the process (in this case, a shell) and moves it to node 1, then continues execution of the script. However, when it comes to the hostname command, the process fails:
###### #! /bin/bash hostname ####### [user@cluster username]$ bpsh 1 ./test.sh /proc/self/fd/3: line 2: hostname: command not found
Since the compute node does not have hostname locally, the shell attempts to resolve the binary by asking for the binary from the master. The problem is that the master has no idea which binary to give back to the node, hence the failure.
Because there is no way for Bproc to know which binaries may be needed by the shell, hostname is not migrated along with the shell during the initial startup. Therefore, it is important to provide the compute node with a full path to the binary:
###### #! /bin/bash /bin/hostname ####### [user@cluster username]$ > bpsh 1 ./test.sh .1
With a full path to the binary, the compute node can construct a proper request for the master. The master knows which exact binary to return to the compute node, so the command works as expected.
Example 5. Process Data Files
Opened files from a process (whether the file is an actual file, a socket, or a named pipe) are not automatically migrated to compute nodes. In the following example, the application BOB needs the data file 1.dat. Unless 1.dat already exists on the compute node, BOB will fail to execute properly.
[user@cluster username]$ bpsh 1 /usr/local/BOB/bin/BOB 1.dat
To keep the process from failing, the necessary data files must be pre-staged on the appropriate compute node(s), or the data files must exist on an NFS mounted file system. So that the data files can be properly opened, /home is mounted on the compute nodes.
Example 6. Installing Commercial Applications
Through the course of its execution, the application BOB in the example above does some work with the data file 1.dat, and then later attempts to call /usr/local/BOB/bin/BOB.helper.bin and /usr/local/BOB/bin/BOB.cleanup.bin.
If these binaries are not in the memory space of the process during migration, the calls to these binaries will fail. Therefore, /usr/local/BOB should be NFS mounted to all of the compute nodes, or the binaries should be pre-staged using bpcp to copy them by hand to the compute nodes. The binaries will stay on each compute node until that node is rebooted.
Generally for commercial applications, the administrator should have $APP_HOME NFS mounted on the compute nodes that will be involved in execution. A general best practice is to mount a general directory such as /opt, and install all of the applications into /opt.