Glossary of Parallel Computing Terms

Bandwidth. A measure of the total amount of information delivered by a network. This metric is typically expressed in millions of bits per second (Mbps) for data rate on the physical communication media or megabytes per second (MBps) for the performance seen by the application.

Backplane Bandwidth. The total amount of data that a switch can move through it in a given time. Typically much higher than the bandwidth delivered to a single node.

Bisection Bandwidth. The amount of data that can be delivered from one half of a network to the other half in a given time, through the least favorable halving of the network fabric.

Boot Image. The file system and kernel seen by the machine at boot time; contains enough drivers and information to get the system up and running on the network.

Cluster. A collection of nodes, usually dedicated to a single purpose.

Compute Node. Nodes attached to master through interconnection network; used as dedicated attached processors. With Scyld, users never need to directly log into compute nodes.

Data Parallel. A style of programming in which multiple copies of a single program run on each node, performing the same instructions while operating on different data.

Efficiency. The ratio of a programs actual speedup to its theoretical maximum.

FLOPS. Floating-point operations per second. A key measure of performance for many scientific and numerical applications.

Grain Size, Granularity. A measure of the amount of computation a node can perform in a given problem between communications with other nodes. Typically defined as "coarse" (large amount of computation) or "fine" (small amount of computation). Granularity is a key in determining the performance of a particular problem on a particular cluster.

High Availability. Refers to level of reliability. Usually implies some level of fault tolerance (ability to operate in the presence of a hardware failure).

Hub. A device for connecting the NICs in an interconnection network. Only one pair of ports can be active at any time (a bus). Modern interconnections utilize switches, not hubs.

Isoefficiency. The ability of a problem to maintain a constant efficiency if the size of the problems scales with the size of the machine.

Jobs. In traditional computing, a job is a single task. A parallel job can be a collection of tasks, all working on the same problem but running on different nodes.

Kernel. The core of the operating system, the kernel is responsible for processing all system calls and managing the system's physical resources.

LAM. The Local Area Multicomputer, a communication library available with MPI or PVM interfaces.

Latency. The length of time from when a bit is sent across the network until the same bit is received. Can be measured for just the network hardware (wire latency) or application-application (includes software overhead).

Local Area Network (LAN). An interconnection scheme designed for short physical distances and high bandwidth. Usually self-contained behind a single router.

MAC Address. On an Ethernet NIC, the hardware address of the card. MAC addresses are unique to the specific NIC, and are useful for identifying specific nodes.

Master Node. Node responsible for interacting with users; connected to both the public network and interconnection network; controls the slave nodes.

Message Passing. Exchanging information between processes, frequently on separate node.

Middleware. A layer of software between the user's application and the operating system.

MPI. The Message Passing Interface, the standard for producing message passing libraries.

MPICH. A commonly used MPI implementation, built on the chameleon communications layer.

Network Interface Card (NIC). The device through which a node connects to the interconnection network. The performance of the NIC and the network it attaches to limit the amount of communication which can be done by a parallel program.

Node. Single computer system (motherboard, one or more processors, memory, possibly disk, network interface).

Parallel Programming. The art of writing programs which are capable of being executed on many processors simultaneously.

Process. An instance of a running program.

Process Migration. Moving a process from one computer to another after the process begins execution.

PVM. The Parallel Virtual Machine, a common message passing library that predates MPI.

Scalability. The ability of a problem to maintain efficiency as the number of processors in the parallel machine increases.

Single System Image. All nodes in the system see identical system files. Same kernel, libraries header files, etc, guaranteeing that a program which will run on one node will run on all nodes.

Socket. A low-level construct for creating a connection between processes on remote system.

Speedup. A measure of the improvement in the execution time of a program on a parallel computer vs. time on a serial computer.

Switch. A device for connecting the NICs in an interconnection network. All pairs of ports can communicate simultaneously.

Version Skew. The problem of having more than one version of software or files (kernel, tools, shared libraries, header files) on different nodes.