How to run on the AWS DP HPC cluster using SLURM

This page describes how to run the rapthor prefect prototype on one or more nodes on the AWS DP HPC cluster using SLURM with the Prefect server running on the headnode.

If you want to run inside a container on your local machine instead, see the quickstart guide.
If you want to run on an interactive compute node instead using a local prefect server on the compute node (i.e. without having to set up the Prefect server on the headnode) see the instructions for running on an interactive node.

There are three main steps to running the rapthor prefect prototype on the AWS DP HPC cluster using SLURM (details for each step are provided below):

Set up a Prefect server on the headnode using the aws-prefect-start.sh script.
Submit a SLURM job to run the code:
- For a single node run you can modify the example (single node SLURM script).
- For a multi-node run you can modify the example (multi-node SLURM script).
(Optional) Monitor the progress of the flow in the Prefect UI and Dask dashboard on your local machine (requires setting up an SSH tunnel).

The SLURM script will run the code in poc.py. Logs will be output to filepaths specified in the slurm script. These include:

slurm-<job_name>-<job_id>.out: standard output and error from the job (including output from DP3 and WSClean which are run by the poc.py script)
versions-<job_name>-<job_id>.txt: a list of versions of key software used in the job, including spack modules and python packages
tunnel-<job_name>-<job_id>.txt: instructions for setting up SSH tunneling to access the Prefect UI and Dask dashboard on your local machine
monitor-<job_name>-<job_id>.log: benchmarking traces and plots for each node when running on multiple nodes.

The scripts have been tested on the AWS DP HPC cluster.

Prerequisites

An account on the AWS DP HPC cluster
This repository cloned to a directory on the AWS DP HPC cluster

Steps

#. Set up a Prefect server on the headnode

If there is already a Prefect server running on the headnode, you can skip this step.

Log into the DP HPC headnode.

Start a tmux session (using e.g. tmux new -s prefect).

Change to the root of project directory.
Run the aws-prefect-start.sh script:
./aws-prefect-start.sh
This will start a Prefect server on the headnode which can be accessed from your local machine using an SSH tunnel (see instructions below).

If something else is using the default port (46200), you can specify a different port by exporting the PREFECT_PORT environment variable before running the script, e.g.:
export PREFECT_PORT=12345
./aws-prefect-start.sh
This will also run aws-setup-env.sh to set up the virtual python environment needed to run prefect, if this does not already exist.

A log file will be created prefect-server-%Y%m%d-%H%M%S.log for outputs and includes instructions for setting up SSH tunneling.
Once the Prefect server is running, you can detach from the tmux session (using CTRL-B D) and leave the server running in the background. If you kill the tmux session, the Prefect server will stop (repeat steps 1-4 to restart it).

You can now access the Prefect UI on your local machine by setting up an SSH tunnel to the headnode (see instructions in the log file mentioned above).

#. Submit a SLURM job to run the proof of concept pipeline

Make sure the Prefect server is running on the headnode before submitting the SLURM job (step 1 above).

Log into the DP HPC headnode.
Change directory to the repository root folder, OR set the REPO_DIR environmental variable to point to the project root, e.g.:
```
export REPO_DIR=~/path/to/repo/ska-sdp-rapthor-prefect-prototype
```
If you changed the default port used by the Prefect server on the headnode, export the PREFECT_PORT environment variable to the same value, e.g.:
```
export PREFECT_PORT=12345
```
Decide how many nodes you want to use for the run. We have provided an example script for a single-node run and a multi-node run.
Edit the SLURM script sbatch file you want to use. You can change the paths or names of the log files and the SLURM job parameters if needed. Check that the SPACK_TAG is set to the version you want to use. For the multi-node run you may also want to change the number of nodes.

Note

The multi-node SLURM script will start a Dask scheduler on the first node using one CPU core and Dask workers on all nodes with (cpus-per-task - 1) cores each. If you use the single-node SLURM script a local cluster will be started by the python script using all available cores on the node.

Note

The multi-node SLURM script will run WSClean with MPI support by default. This will allow WSClean to perform parallel gridding using multiple MPI processes. Since we have not tuned the WSClean parameters for running with MPI, the performance may not be optimal. You can experiment with different WSClean parameters in the poc.py script if you want to try to improve the performance. See the WSClean MPI documentation for more information. If you want to run WSClean on a single node you can edit the line that runs the poc.py script by removing the --use-mpi option. DP3 and Dask will still use all nodes.

Submit the SLURM job using sbatch, e.g.:

sbatch aws-run-poc.sbatch

or

sbatch aws-run-poc-multi-node.sbatch

You can check the status of the job using squeue or sacct.
Optional: monitor the job using the Prefect UI and Dask dashboard from your local machine. This will also show previous runs.
1. Once the job has started, copy the ssh tunnelling command from the tunnel-<job_name>-<job_id>.txt file created by the SLURM script.
2. Open a terminal on your local machine
3. Start an aws session
4. Paste the command you copied from the log file, changing the username and hostname to whatever you normally use to access the headnode.
5. Open a web browser and go to http://127.0.0.1:14200 to access the Prefect UI and http://127.0.0.1:18787 to access the Dask dashboard.
  
  Note
  
  The Dask dashboard will only be available once the Dask cluster has been started by the flow (after the SLURM job has started running) and may have a different IP between runs, depending on which compute node is running the job.