How to run on the AWS DP HPC cluster using SLURM ================================================ This page describes how to run the rapthor prefect prototype on one or more nodes on the AWS DP HPC cluster using SLURM with the Prefect server running on the headnode. - If you want to run inside a container on your local machine instead, see the `quickstart guide <../quickstart.html>`_. - If you want to run on an interactive compute node instead using a local prefect server on the compute node (i.e. without having to set up the Prefect server on the headnode) see the `instructions for running on an interactive node <./aws_interactive.html>`_. .. image:: ../_static/images/pipeline-overview.png :alt: Pipeline flow diagram :align: center :width: 80% There are three main steps to running the rapthor prefect prototype on the AWS DP HPC cluster using SLURM (details for each step are provided below): #. **Set up a Prefect server on the headnode** using the `aws-prefect-start.sh `_ script. #. **Submit a SLURM job** to run the code: - For a single node run you can modify the example (`single node SLURM script `_). - For a multi-node run you can modify the example (`multi-node SLURM script `_). #. (Optional) **Monitor the progress** of the flow in the Prefect UI and Dask dashboard on your local machine (requires setting up an SSH tunnel). The SLURM script will run the code in `poc.py `_. Logs will be output to filepaths specified in the slurm script. These include: - ``slurm--.out``: standard output and error from the job (including output from DP3 and WSClean which are run by the ``poc.py`` script) - ``versions--.txt``: a list of versions of key software used in the job, including spack modules and python packages - ``tunnel--.txt``: instructions for setting up SSH tunneling to access the Prefect UI and Dask dashboard on your local machine - ``monitor--.log``: benchmarking traces and plots for each node when running on multiple nodes. The scripts have been tested on the AWS DP HPC cluster. Prerequisites ------------- - An account on the AWS DP HPC cluster - This repository cloned to a directory on the AWS DP HPC cluster Steps ----- #. Set up a Prefect server on the headnode ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If there is already a Prefect server running on the headnode, you can skip this step. #. Log into the DP HPC headnode. #. Start a tmux session (using e.g. ``tmux new -s prefect``). #. Change to the root of project directory. #. Run the `aws-prefect-start.sh `_ script: .. code-block:: bash ./aws-prefect-start.sh This will start a Prefect server on the headnode which can be accessed from your local machine using an SSH tunnel (see instructions below). If something else is using the default port (46200), you can specify a different port by exporting the ``PREFECT_PORT`` environment variable before running the script, e.g.: .. code-block:: bash export PREFECT_PORT=12345 ./aws-prefect-start.sh This will also run `aws-setup-env.sh `_ to set up the virtual python environment needed to run prefect, if this does not already exist. A log file will be created ``prefect-server-%Y%m%d-%H%M%S.log`` for outputs and includes instructions for setting up SSH tunneling. #. Once the Prefect server is running, you can detach from the tmux session (using ``CTRL-B D``) and leave the server running in the background. If you kill the tmux session, the Prefect server will stop (repeat steps 1-4 to restart it). #. You can now access the Prefect UI on your local machine by setting up an SSH tunnel to the headnode (see instructions in the log file mentioned above). #. Submit a SLURM job to run the proof of concept pipeline ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Make sure the Prefect server is running on the headnode before submitting the SLURM job (step 1 above). #. Log into the DP HPC headnode. #. Change directory to the repository root folder, OR set the ``REPO_DIR`` environmental variable to point to the project root, e.g.: .. code-block:: bash export REPO_DIR=~/path/to/repo/ska-sdp-rapthor-prefect-prototype #. If you changed the default port used by the Prefect server on the headnode, export the ``PREFECT_PORT`` environment variable to the same value, e.g.: .. code-block:: bash export PREFECT_PORT=12345 #. Decide how many nodes you want to use for the run. We have provided an example script for a `single-node `_ run and a `multi-node `_ run. #. Edit the SLURM script sbatch file you want to use. You can change the paths or names of the log files and the SLURM job parameters if needed. Check that the ``SPACK_TAG`` is set to the version you want to use. For the multi-node run you may also want to change the number of nodes. .. note:: The multi-node SLURM script will start a Dask scheduler on the first node using one CPU core and Dask workers on all nodes with ``(cpus-per-task - 1)`` cores each. If you use the single-node SLURM script a local cluster will be started by the python script using all available cores on the node. .. note:: The multi-node SLURM script will run WSClean with MPI support by default. This will allow WSClean to perform parallel gridding using multiple MPI processes. Since we have not tuned the WSClean parameters for running with MPI, the performance may not be optimal. You can experiment with different WSClean parameters in the ``poc.py`` script if you want to try to improve the performance. See the `WSClean MPI documentation `_ for more information. If you want to run WSClean on a single node you can edit the line that runs the poc.py script by removing the ``--use-mpi`` option. DP3 and Dask will still use all nodes. #. Submit the SLURM job using ``sbatch``, e.g.: .. code-block:: bash sbatch aws-run-poc.sbatch or .. code-block:: bash sbatch aws-run-poc-multi-node.sbatch #. You can check the status of the job using ``squeue`` or ``sacct``. #. Optional: monitor the job using the Prefect UI and Dask dashboard from your local machine. This will also show previous runs. #. Once the job has started, copy the ssh tunnelling command from the ``tunnel--.txt`` file created by the SLURM script. #. Open a terminal on your local machine #. Start an aws session #. Paste the command you copied from the log file, changing the username and hostname to whatever you normally use to access the headnode. #. Open a web browser and go to ``http://127.0.0.1:14200`` to access the Prefect UI and ``http://127.0.0.1:18787`` to access the Dask dashboard. .. note:: The Dask dashboard will only be available once the Dask cluster has been started by the flow (after the SLURM job has started running) and may have a different IP between runs, depending on which compute node is running the job.