File Systems

File System Options

Data files used by the applications processed on the cluster may be stored in a variety of locations, including:

The simplest approach is to store all files on the master node, as with the standard Network File System. Any files in your /home directory are shared via NFS with all the nodes in your cluster. This makes management of the files very simple, but in larger clusters the performance of NFS on the master node can become a bottleneck for I/O-intensive applications. If you are planning a large cluster, you should include disk drives that are separate from the system disk to contain your shared files; for example, place /home on a separate pair of RAID1 disks in the master node. A more scalable solution is to utilize a dedicated NFS server with a properly configured storage system for all shared files and programs, or a high performance NAS applicance.

Storing files on the local disk of each node removes the performance problem, but makes it difficult to share data between tasks on different nodes. Input files for programs must be distributed manually to each of the nodes, and output files from the nodes must be manually collected back on the master node. This mode of operation can still be useful for temporary files created by a process and then later reused on that same node.

An alternate solution is to use a parallel file system, which provides an interface much like a network file system, but distributes files across disks on more than one node. Scyld provides a version of PVFS, the Parallel Virtual File System, but other commercial parallel file systems work with ClusterWare as well.

PVFS

Parallel Virtual File System allows applications, both serial and parallel, to store and retrieve data that has been distributed across a set of I/O servers. This is done through traditional file I/O semantics; you can open, close, read, write, and seek in PVFS files, just as you can in locally-stored files.

The primary goal of PVFS is to provide a high performance "global scratch space" for Scyld clusters running parallel applications. PVFS will "stripe" files across the disks of the nodes in your cluster, resulting in file access faster than that of a single disk.

Within your cluster, any given node may take on one or more of the following roles:

The following figure shows the PVFS system view, including a metadata server (indicated as the "Management Node" or "MGR"), a number of I/O servers ("IONi") each with a local disk, and a set of clients ("CNi").

Figure 1. PVFS System Diagram

The master node can be configured as the metadata server, and the other nodes as both clients and I/O servers. This allows you to run parallel jobs accessing PVFS files from any node, striping these files across all the cluster compute nodes.

A PVFS file system appears to a user much as any other file system. Once the system administrator mounts the PVFS file system on your local directory tree, you can cd into the directory; list the files in the directory with ls; and copy, move, or delete files with cp, mv, or rm.

Copying Files to PVFS

PVFS will provide a default striping of your data across the I/O servers for your file when you use a standard Unix command, like cp, to copy files into a PVFS directory. The u2p command, supplied with PVFS, is used to copy an existing Unix file to a PVFS file system while specifying physical distribution parameters, specifically the following:

  • base — the index of the starting I/O node, with 0 being the first file system node

  • pcount — partition count (a bit of a misnomer), the number of I/O servers on which data will be stored

  • ssize — strip size, the size of the contiguous chunks stored on I/O servers

The u2p function is most useful in converting pre-existing data files to PVFS, so that they can be used in parallel programs. The syntax for u2p is as follows:

u2p -s <stripe size> -b <base> -n <# of nodes> <srcfile> <destfile> 

The following illustration shows a PVFS file system with a base node of 0 and a pcount of 4.

Figure 2. Striping Example

Examining File Distributions

pvstat will show the physical distribution parameters for a PVFS file.

In the following example, for a file named foo in the PVFS file system mounted at /pvfs, pvstat tells us that foo has a stripe size of 64k and is currently striped among 8 I/O servers, beginning at server 0:

[user@cluster username]$ /usr/local/bin/pvstat /pvfs/foo
/pvfs/foo: base = 0, pcount = 8, ssize = 65536

Checking on Server Status

The iod-ping utility determines the state of a given I/O server. In the following example, we have started the I/O server on node 1, which is reported as responding; as there is no I/O server on the master node, it is reported as down:

[user@cluster username]$ /usr/local/bin/iod-ping -h 1 -p 7000
1:7000 is responding.
[user@cluster username]$ /usr/local/bin/iod-ping -h head -p 7000
head:7000 is down.

Likewise, the mgr-ping utility is used to check the status of metadata servers. In the following example, the response shows a metadata server responding on the master node, but not on compute node 1:

[user@cluster username]$ /usr/local/bin/mgr-ping -h head -p 3000
head:3000 is responding.
[user@cluster username]$ /usr/local/bin/mgr-ping -h 1 -p 3000
1:3000 is down.

These two utilities also set their exit values appropriately for use with shell scripts; 0 is used for success (responding), and 1 is used on failure (down).

When no additional command-line parameters are passed, both programs will automatically check for servers on localhost (the local machine) at default port 7000 for an I/O server and 3000 for a metadata server. You can use the -p option to alter the port checked.