Nektar++ on ARCHER2

The ARCHER2  national supercomputer is a world class advanced computing resource and is the successor to ARCHER. This guide is intended to provide basic instructions for compiling the Nektar++ stable release or master branch on the ARCHER2 system.

Compilation Instruction

ARCHER2 uses module based system to load various system modules. For compiling Nektar++ on ARCHER we need to choose the GNU compiler suite and load required modules. Note that by logging to ARCHER2, the system automatically loads cmake which its default version at the time of writing this instruction is 3.18.4. Further, git is also automatically available.
Basic module commands are briefly explained here.

export CRAY_ADD_RPATH=yes
module restore PrgEnv-gnu
module load cray-fftw

These options can be put in the file to avoid typing them for each session. Note that after running system prints several warnings and information messages about system environment variables which are being unloaded and newly loaded. You can simply ignore these messages. Just type q to get to the end of messages and then load fftw.

To clone the repository, first create a public/private ssh key-pair and add it to the gitLab. Instructions on creating ssh key can be found at Generating a new SSH key pair . If the ssh keys have already been set up, this step can be skipped.

The code must be compiled and run from work directory, which is at /work/project_code/project_code/user_name . For example, for the project code e01 and username mlahooti, the work directory can be accessed at /work/e01/e01/mlahooti. You can also echo $HOME which in this example will prints /home/e01/e01/mlahooti, and change the /home/ part to /work/ to access your work directory.

Enter the work directory and clone the Nektar++ code into a folder, e.g. nektarpp

cd /work/e01/e01/mlahooti
git clone nektarpp

After the code is cloned, enter the nektarpp folder, make a build directory and enter it

cd nektarpp
mkdir build
cd build

The above three steps can be done with a single line command too
cd nektarpp && mkdir build && cd build

From within the build directory, run the configure command. Note the use of CC and CXX to select the special ARCHER-specific compilers.

  • cc and CC are the C and C++ wrappers for the Cray utilities and determined by the PrgEnv module.
  • SYSTEM_BLAS_LAPACK is disabled since, by default, we can use the libsci package which contains an optimized version of BLAS and LAPACK and not require any additional arguments to cc.
  • HDF5 is a better output option to use on ARCHER2 since often we run out of the number of files limit on the quota. Setting this option from within ccmake has led to problems however so make sure to specify it on the cmake command line as above. Further, the HDF5 version on the ARCHER2 is not supported at the moment, so here it is built as a third-party library.
  • We are currently not using the system boost since it does not appear to be using C++11 and so causing compilation errors.

At this point you can run ccmake .. to e.g. disable unnecessary solvers. Now run make as usual to compile the code

make -j 4 install

NOTE: Do not try to run regression tests – the binaries at this point are cross-compiled for the compute nodes and will not execute properly on the login nodes.

Running job on ARCHER2

ARCHER2 uses slurm for job submission which is different from PBS used in Imperial College CX1 and CX2. Nektar++ must be build in the work directory and jobs also must be submitted from work directory.
ARCHER2 supports three different Quality of Service (QoS) which is the type of job that can be run: standard, short and long. All of theses QoSs are on standard partition. A brief overview of these QoS is provided below and detailed description can be found in ARCHER2 documentation on running jobs on ARCHER2.

  • Standard : standard QoS allows maximum of 940 nodes where each node can support 128 task (processes). The maximum wall time for this category is 24 hours. This is the most commonly used QoS
  • Short : Short Qos allows maximum of 8 nodes with maximum wall time of 20 minutes. Jobs with short QoS can only be submitted during Monday-Friday.
  • Long : Long QoS allows maximum of 64 nodes with maximum wall time of 48 hours. The minimum wall time for Long QoS jobs must be 24 hours.

Slurm job script must contains number of nodes, number of task per node, number of cpus per task, wall time, budget ID, partition type, quality of service (QoS), number of OpenMp threads, job environment and execution command. It can also optioanlly have the user supplied job name for easier identification of the job.

The job script can be produced using the bolt module as follows, note that the arguments should be replaced with the program executable and its arguments. For more help you can run bolt -h in the terminal.

module load bolt
bolt -n [parallel tasks] -N [parallel tasks per node] -d [number of threads per task] -t [wallclock time (h:m:s)] -o [script name] -j [job name] -A [project code]  [arguments...]

For an example consider if Nektar++ is installed in /work/e01/e01/mlahooti/nektarpp and the simulation is a 3D homogeneous 1D (2.5D) simulation with HomModesZ=8. We want to do the simulation on 256 processors which is 2 nodes each 128 processes for 14 hours and 20 minutes with Hdf5 output format. Also, we want to assign a name for the job, e.g. firstTest. Also suppose that we are using the budget with project project_id
Here is an example of slurm script for a standard job


# Slurm job options (job-name, compute nodes, job time)
#SBATCH --job-name=firstTest
#SBATCH --time=14:20:0
#SBATCH --nodes=2
#SBATCH --tasks-per-node=128
#SBATCH --cpus-per-task=1

# Replace [budget code] below with your budget code (e.g. t01)
#SBATCH --account=project_id
#SBATCH --partition=standard
#SBATCH --qos=standard

# Setup the job environment (this module needs to be loaded before any other modules)
module load epcc-job-env

# Set the number of threads to 1
#   This prevents any threaded system libraries from automatically 
#   using threading.

export NEK_DIR=/work/e01/e01/mlahooti/nektar-master/build
export NEK_BUILD=$NEK_DIR/dist/bin
export LD_LIBRARY_PATH=/opt/gcc/10.1.0/snos/lib64:$NEK_DIR/ThirdParty/dist/lib:$NEK_DIR/dist/lib64:$LD_LIBRARY_PATH

# Launch the parallel job

srun --distribution=block:block --hint=nomultithread $NEK_BUILD/IncNavierStokesSolver naca0012.xml session.xml --npz 4 -i Hdf5 &> runlog

In the above script note the module load epcc-job-env which exports the job environment and must be present in the scrip.
Further, for more convenient the script contains two export commands which defines NEK_DIR and NEK_BUILD environment variables, the former is the path to Nektar++ build directory and the latter to the solver executable location. Additionally, the third export is to add the libraries location to the system path, where each library path is separated from others by colon : . I also exported the library path for gfortran since when I tried to run, the run terminated with error that cannot find gfortran.

To submit the job, assuming the above script is saved in a file named myjob.slurm run the following command
sbatch myjob.slurm

The job status can be monitored using squeue -s $USER
running this command prints the following information on the screen, where ST is the status of the job, here PD means the job is waiting for resource allocation, other common status are R, F, CG, CD and CA where means running, failed, in the process of completing, completed and cancelled respectively.

121062    standard     myJob-1   mlahooti     PD       0:00        4 (Priority)
121064    standard     myJob-2   mlahooti     PD       0:00        4 (Priority)

Cancelling a job can be using scancell job-ID command, where the job-ID, is the id of the job. for example, the job id for the first job above is 121062.
Further, detailed information about a particular job, including the estimation for start time can be obtained via
scontrol show job -dd job-ID

NOTE: It is highly recommended that the job script checked to be error free before submiting to the system. Using checkScript command checks for integrity of the job scrip, shows the errors and estimate the budget it will consume. Run the following command in the directory you want to submit the job for checking the script

checkScript myjob.slurm