UCSC Supercomputer Pleiades QuickStart Guide

Contents

  1. System Overview
  2. User Environment
  3. Compiling codes
  4. Running codes
  5. Debugging codes
  6. Visualization & Data Analysis
  7. Using MVAPICH
  8. Using Intel MPI
  9. Frequently Asked Questions

System Overview

UCSC Supercomputer Pleiades Hardware Summary:

System Name:Pleiades
Architecture:Linux Cluster
Number of Compute Nodes:207
Number of Processing Cores:828
Total Memory:1.656 TB
Peak Performance:5.9 TFLOPS
Number of Visualization Nodes:2 (maia & alcyone)
Number of Storage Nodes:4
Storage:10TB (work), 45TB (Arbeit)
Interconnects:Gigabit Ethernet & DDR Infiniband

Frontend

The frontend to the UCSC Astrophysics Supercomputer Pleiades is pleiades.ucsc.edu (IP: 128.114.126.225). To access Pleiades, one needs to log into this login/management node using ssh with a valid user account.

$ ssh -l username pleiades.ucsc.edu

The frontend is a Dell PowerEdge 2950 server that contains two quad-core Intel Xeon E5345 processors at 2.33 GHz, 16 GB memory, and 440 GB local disks in a RAID-5 array.

Compute nodes

There are 207 compute nodes (Dell PowerEdge 1950) in the Pleiades cluster. Each compute node has two 2.33GHz dual-core Intel Xeon 5148LV processors, 8GB of memory, and a 80GB SATA disk drive. Peak performance for each compute core is 9.3 GFLOPS. Some of the key features of the Intel Xeon 5148LV processor (Woodcrest family of the Intel Core Microarchitecture) are: dual-core, 32-KB L1 instruction cache and 32-KB L1 data cache, 4MB Advanced Smart (sharing) L2 cache, 14 unit pipeline, eight pre-fetch units, Macro Ops Fusion, double-speed integer units, and 40 W max TDP. The memory system uses Fully Buffered DIMMS (FB-DIMMS) and a 1333 MHz (10.7 GB/sec) front side bus.

Visualization & Analysis nodes

There are two Visualization & Analysis nodes in the Pleiades clusters: maia & alcyone. Both nodes can be accessed via ssh:

$ ssh -l username maia.ucsc.edu $ ssh -l username alcyone.ucsc.edu

Maia is a Dell PowerEdge 1950 server with two 2.33GHz dual-core Intel Xeon 5148LV processors and 32GB of memory. Alcyone is a HP xw8600 workstation with an nVidia Quadro FX 5600 graphics card, two 3.16GHz quad-core Intel Xeon X5460 processors and 64GB of memory.

Storage

The global storage subsystem is comprised of two parts: the fast work and the more spacious Arbeit. The fast work uses 146GB 15K RPM SAS disk drives, which are housed in six Dell PowerVault MD1000 enclosures and served from two Dell PowerEdge 2950 servers. The total raw capacity of work is 13TB with 10TB usable. The more spacious Arbeit uses 500GB 7200 RPM SATA disk drives, which are housed in eight Dell PowerVault MD1000 enclosures and served from two Dell PowerEdge 2950 servers. The total raw capacity of Arbeit is 60TB with 45TB usable. The global storage is accessible to all users on all nodes in the Pleiades cluster via the Ibrix file systems.

Interconnects

Each node is interconnected to two switch fabrics: gigabit Ethernet and non-blocking DDR InfiniBand. The core of the InfiniBand fabric is a Cisco SFS 7024D IB Server Switch, which supports 4X DDR (20 Gbps) IB ports, and offers fully nonblocking 11.5 Tbps of cross-sectional bandwidth with less than 200 nanoseconds port to port latency. The point-to-point bandwidth of the IB fabric is 20Gbps in theory and is measured to be in excess of 12 Gbps in real world applications (unidirectional speed).

User Environment

Each user has a home directory at /home/$USER, where $USER is the username. The home directory is NFS-mounted on all the nodes in the Pleiades cluster. It has a limited capacity and is only intended to store source codes and configuration files. In the home directory, there are two symbolic links: work and Arbeit, with work pointing to /ibrixfs/$USER (the fast global storage) and Arbeit to /ifs/$USER (the more spacious storage). You should run codes from work or Arbeit.

NOTE: We plan to impose quotas on /ibrixfs to make it less than 20 percent occupied. We expect that our codes will run a lot faster this way. Please remove your old data from work as soon as possible.

Module

We use the module tool to manage software environment. The system initialization files have been modified such that the module files for Intel compilers 11.1 and OpenMPI over Infiniband are loaded by default. You can verify this by running the following command:

$ module list Currently Loaded Modulefiles: 1) intel_compilers/11.1.064 2) openmpi_intel/1.2.8

You can find out what modules are available by running:

$ module avail

You can learn the usage of the module tool by running:

$ module --help

Compiling codes

Compiling Serial Programs

The following table summarizes how to compile C/C++ and Fortran 77/90 serial programs using the Intel compilers.

Compiler Program TypeSuffix Example
icc C .c icc [compiler_options] prog.c
icc C++ .C, .cc, .cpp, .cxx icc [compiler_options] prog.cpp
ifort F77 .f, .for, .ftn ifort [compiler_options] prog.f
ifort F90 .f90, .fpp ifort [compiler_options] prog.f90

Compiling MPI Programs

At login, the module files for Intel compilers 11.1 and OpenMPI over Infiniband are loaded to produce the default environment. OpenMPI is a full implementation of the MPI-2 standard over Infiniband. OpenMPI provides a set of "mpicmds" that support the compilation and execution of parallel MPI programs over Infiniband. The following table summarizes how to compile MPI programs in C/C++ and Fortran 77/90.

Compiler Program TypeSuffix Example
mpicc C .c mpicc [compiler_options] prog.c
mpiCC/mpicxx/mpic++ C++ .C, .cc, .cpp, .cxx mpiCC [compiler_options] prog.cpp
mpif77 F77 .f, .for, .ftn mpif77 [compiler_options] prog.f
mpif90 F90 .f90, .fpp mpif90 [compiler_options] prog.f90

The "mpicmds" in the table above are just wrappers of the Intel compilers. They automatically link startup and message passing libraries for OpenMPI into the executables. Here are a few examples:

### Sample MPI "hello world" application in C $ mpicc -i-dynamic -o mpi_hello.x mpi_hello.c ### If you don't use the mpicc wrapper, the equivalent command will be much more tedious $ icc -i-dynamic -o mpi_hello.x mpi_hello.c -I/usr/mpi/intel/openmpi-1.2.8/include -L/usr/mpi/intel/openmpi-1.2.8/lib64 -lmpi ### Sample MPI "hello world" application in C++ $ mpiCC -i-dynamic -o mpi_hello.x mpi_hello.cc ### Sample MPI "hello world" application in Fortran 77 $ mpif77 -i-dynamic -o mpi_hello.x mpi_hello.f ### Sample MPI "hello world" application in Fortran 90 $ mpif90 -i-dynamic -o mpi_hello.x mpi_hello.f90

If you omit the -i-dynamic option, you'll get a warning, which, however, is benign and can be safely ignored:

$ mpicc -o mpi_hello.x mpi_hello.c /opt/intel/Compiler/11.1/064/lib/intel64/libimf.so: warning: warning: feupdateenv is not implemented and will always fail

Compiling OpenMP Programs

Since each of the PowerEdge 1950 nodes in the Pleiades cluster is a Xeon dual-core dual-processor system, applications can use the shared memory programming paradigm "on node". However, because of the limited number of processors in each node, there are rarely any significant performance benefits to using a shared-memory model on the node. Here are a few examples on how to compile OpenMP programs:

### Sample OpenMP "hello world" application in C $ icc -openmp -o omp_hello.x omp_hello.c ### Sample OpenMP "hello world" application in Fortran 77 $ ifort -openmp -o omp_hello.x omp_hello.f

Note: For OpenMP programming, it is very important to know the cache line size of the CPU. Here is a tip on how to get the value on a Linux machine:

$ cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size 64 (bytes)

Compiling hybrid Programs

Hybrid programs use multithreading within a node and message passing between nodes. It seems to be the natural programming model for attaining high performance on SMP cluster, of which Pleiades is an example; that performance, however, is sometimes hard to achieve.

Unfortunately, OpenMPI is currently neither thread-safe nor async-signal-safe; so we can't run hybrid programs with OpenMPI. However, both MVAPICH and Intel MPI are thread-safe and support the hybrid programming model. Please check the guides on MVAPICH and Intel MPI to learn how to compile hybrid programs.

Intel Compiler Options

Compiler options must be used to achieve optimal performance of any application. Generally, the highest impact can be achieved by selecting an appropriate optimization level, by targeting the architecture of the computer (CPU, cache, memory system), and by allowing for interprocedural analysis (inlining, etc.). There is no set of options that gives the highest speed-up for all applications. Consequently, different combinations have to be explored.

At the most basic level of optimization that the compiler can perform is -On options, explained below.

Level Description
n = 0: Fast compilation, full debugging support; equivalent to -g
n = 1,2: Low to moderate optimization, partial debugging support:
  • instruction rescheduling
  • copy propagation
  • software pipelining
  • common subexpression elimination
  • prefetching, loop transformations
n = 3+: Aggressive optimization - compile time/space intensive and/or marginal effectiveness;
may change code semantics and results (sometimes even breaks code!) :
  • enables -O2
  • more aggressive prefetching, loop transformations

The following table lists some of the more important compiler options that affect application performance, based on the target architecture, application behavior, loading, and debugging.

Option Description
-c For compilation of source file only.
-O3 Aggressive optimization (-O2 is default).
-xSSE3 Generates code with streaming SIMD extensions SSE3 for EM64T architecture.
-g Debugging information, generates symbol table.
-mp Maintain floating point precision (disables some optimizations).
-mp1 Improve floating-point precision (speed impact is less than -mp).
-ip Enable single-file interprocedural (IP) optimizations (within files).
-ip0 Enable multi-file IP optimizations (between files).
-prefetch Enables data prefetching (requires –O3).
-openmp Enable the parallelizer to generate multi-threaded code based on the OpenMP directives.
-openmp_report[0|1|2] Controls the OpenMP parallelizer diagnostic level.

For more compiler/linker options, check the ifort and icc man pages, or consult the following online documentations:

Running Codes

On Pleiades we use Torque as the queuing system and Maui as the job scheduler. Torque is a derivative of OpenPBS. Commonly used Torque tools include qsub, for job submission; qstat, for monitoring the status of jobs; and qdel, for terminating jobs prior to completion. For more detailed information regarding these commands, check the man pages.

To run your codes on Pleiades, you need to create a Torque/PBS script, and then use the qsub command to submit the job to the queue. Below are examples on how to run various kinds of programs on Pleiades.

Running MPI programs

Now assume that you've successfully compiled your MPI program mpi_hello.c with OpenMPI, and you want to run the binary mpi_hello.x on 16 processors/cores. First, copy the binary mpi_hello.x to a subdirectory of work or Arbeit. NOTE: Do not run your code from your home directory, which is very slow and limited in capacity.

Then create a Torque/PBS script with any name of your choice. For educational purpose, let's assume the file name is openmpi.pbs. Below is listed the content of openmpi.pbs with annotations:

### The shell that interprets the job script #PBS -S /bin/bash ### Name of the job #PBS -N hello ### Requesting 4 nodes, with 4 processors per node ### Note: ncpus=16 does NOT work! #PBS -l nodes=4:ppn=4 ### Requesting 4 hours of computing time #PBS -l walltime=4:00:00 ### Go to directory where your code is cd $PBS_O_WORKDIR ### Do not omit the "-hostfile $PBS_NODEFILE" option ### Otherwise, all 16 processes will be running on 1 node! mpirun -hostfile $PBS_NODEFILE -np 16 ./mpi_hello.x

You are now ready to submit the job:

$ qsub openmpi.pbs

Running Serial Programs

To run your serial binary hello.x on pleiades, create a Torque/PBS script named serial.pbs, which contains the following lines:

#PBS -S /bin/bash #PBS -N hello #PBS -l ncpus=1 #PBS -l walltime=4:00:00 cd $PBS_O_WORKDIR ./hello.x

and then submit the job with:

$ qsub serial.pbs

Running OpenMP Programs

To run your OpenMP binary omp_hello.x on all 4 cores of a compute node, create a Torque/PBS script named omp.pbs, which contains the following lines:

#PBS -S /bin/bash #PBS -N hello #PBS -l ncpus=4 #PBS -l walltime=4:00:00 cd $PBS_O_WORKDIR ./omp_hello.x

and then submit the job with:

$ qsub omp.pbs

Running Hybrid Programs

Unfortunately, OpenMPI is currently neither thread-safe nor async-signal-safe; so we can't run hybrid programs with OpenMPI. However, both MVAPICH and Intel MPI are thread-safe and support the hybrid programming model. Please check the guides on MVAPICH and Intel MPI to learn how to run hybrid programs.

Debugging codes

This section covers the topic of using the TotalView Debugger with OpenMPI programs. TotalView is, without question, the most popular HPC debugger to date. While Totalview can be used for debugging and analyzing serial programs, it really shines when used to debug, analyze, and tune the performance of complex, multi-process and/or multi-threaded applications. Below we present a step-by-step guide on how to start a Totalview debugging session of the sample MPI program mpi_hello.c.

a). Since the TotalView GUI is an X-Window application, your SSH connection to Pleiades needs to allow X11 forwarding. This can be accomplished by adding the line "ForwardX11Trusted yes" to your SSH config file or by using the -Y option to SSH:

$ ssh -Y -l username pleiades.ucsc.edu

NOTE: Since the later versions of SSH use untrusted X11 cookies by default, the -X flag (or equivalently, "ForwardX11 yes" in your SSH config file) will not work! Use the -Y flag instead.

b). Append the following line to your .bashrc (or .cshrc if you use C Shell), if it is not already there:

module load totalview/8.7.0

c). Compile your code with the -g option:

$ mpicc -i-dynamic -g -o mpi_hello.x mpi_hello.c

d). Create a Torque/PBS script tv.pbs, which contains the following lines:

#PBS -S /bin/bash #PBS -N tv #PBS -l nodes=2:ppn=4 #PBS -l walltime=1:00:00

Here walltime should be the maximum time you'll use for your debugging session (you can always end it sooner), and nodes and ppn should be the number of nodes and processors per node you want to use for your debugging session.

At the pleiades command prompt, submit the request for an interactive job:

$ qsub -I -X tv.pbs qsub: waiting for job 1234.pleiades.ucsc.edu to start ### Or the same goal can be achieved without a PBS script $ qsub -I -X -N tv -l nodes=2:ppn=4,walltime=1:00:00 qsub: waiting for job 1234.pleiades.ucsc.edu to start

This requests a set of compute nodes for you to run the debugging session on. The -I flag tells the scheduler that you want to use these nodes interactively, and the -X flag tells it that you want X11 forwarding enabled. If the nodes are available, you will see a message like "qsub: job 1234.pleiades.ucsc.edu ready", and you will be dropped to your home directory on one of the compute nodes.

e). Go to the directory where your code is:

$ cd $PBS_O_WORKDIR

f). Unfortunately, we can't use the method described in Open MPI FAQ to start a Totalview debugging session. With this method, we always end up deep in the machine code of mpirun itself! Most likely, OpenMPI on Pleiades was not compiled with proper debugging support.

### DOES NOT WORK! $ mpirun --debug -hostfile $PBS_NODEFILE -np 8 ./mpi_hello.x

Let's use the indirect method instead. At the command prompt, run

$ totalview &

Two windows will pop up. In the "New Program" window, go to the "Program" tab, click the "Browse..." button to choose your executable (or just type the name of your executable in the field for "Program:").

Program tab of the New Program window

Then switch to the "Arguments" tab and enter any command line arguments or environment variables you would like to use.

Next switch to the "Parallel" tab. Select "Open MPI" from the list of "Parallel system:", and enter the desired values into "Tasks:" (this value should match the number you requested earlier with qsub). You can leave the value of "Nodes:" as the default 0. Then in the field "Additional starter arguments:", enter "-hostfile $PBS_NODEFILE". Then click "OK".

Parallel tab of the New Program window

g). Totalview should now load your program and you can begin debugging.

Totalview process window

h). When you are done with debugging, don't forget to log out of the compute node.

References

To learn more about the Totalview Debugger, please consult the following documentations:

Although not covered in this section, two other venerable debuggers are available on Pleiades: GDB & IDB, both of which are very capable debuggers for serial and OpenMP programs. For more information, please consult the following online documentations:

Visualization & Data Analysis

The following visualization & data analysis packages are available on the visualization nodes maia & alcyone:

IDL is in the default path, and the other 3 tools are available as modules:

$ module avail paraview/3.6.1 vapor/1.5.2 visit/1.12.1

To run, for example, VisIt on alcyone, first start an SSH connection to alcyone with X11 forwarding enabled:

### If the "-X" flag does not work, use "-Y" instead. $ ssh -X -l username alcyone.ucsc.edu

At command prompt of alcyone, load the module for VisIt if it has not already been loaded, and then start VisIt:

$ module load visit/1.12.1 ### Or you could append the following line to your .bashrc: [ `hostname` == alcyone.local ] && module load visit/1.12.1 $ visit &

This works. However, a far better approach is to run VisIt in distributed mode, in which the GUI and viewer run locally on your workstation, while the database server and compute engine run remotely on alcyone. Below we show you how to configure VisIt to run in distributed mode.

Configuring VisIt to run in distributed mode

a). VisIt must be in your default search path. On alcyone, append the following line to your .bashrc:

[ `hostname` == alcyone.local ] && module load visit/1.12.1

b). Download and install a pre-compiled VisIt executable onto your workstation.

c). Start VisIt on your workstation. Now you need to create a host profile for alcyone. Open the Host profiles window by choosing Host profiles from the Options menu. Click the New profile button. Under the Selected profile tab, type "alcyone" to the Profile name field, "alcyone.ucsc.edu" to the Remote host name field, "alcyone" to the Host name aliases field, and "alcyone" to the Host nickname field. Click the Apply button

Host profiles window of VisIt

Switch to the Advanced options tab. Check the Tunnel data connections through SSH option.

Host profiles window of VisIt

Click the Apply button, then the Dismiss button. Once the Host profiles window is gone, don't forget to click Save Settings from the Options menu; otherwise, the host profile for alcyone will be lost when you exit VisIt!

NOTE: If you are lazy, you can skip this step and just download this example config file to the $HOME/.visit/ directory on your workstation.

NOTE: If your workstation runs Microsoft Windows, one extra step is required to get this to work! On alcyone, create a $HOME/.ssh/environment file that contains the following line:

BASH_ENV=~/.bashrc

Running VisIt in distributed mode

The procedure for running VisIt in distributed mode is no different than it is for running in single-computer mode. You begin by opening the File Selection window and typing the name of the computer where the files are stored into the Host text field. Type or choose "alcyone.ucsc.edu" and select the data files you want to visualize. Have fun!

File selection window of VisIt