Working with Clusters for Deep Learning

Created by Jona Ruthardt (01/2025)

Before you start …

This tutorial will help you start using SLURM-based GPU clusters for your deep learning projects. Its goal is to guide you through the process of accessing and effectively utilizing a cluster, particularly for people who may be new to cluster computing.

By the end of the tutorial, you will understand:

What a GPU cluster is and how its components interact
How to connect to a cluster and set up your environment
How to schedule and manage jobs to access GPUs for your experiment
Best practices for working with a cluster to maximize your productivity and the efficiency of your experiments

While this tutorial was created for courses at UTN and, therefore, provides some instructions specific to the Alex cluster by FAU, most aspects translate to other compute and GPU clusters as well. It can still serve as a helpful resource and introduction to the topic.

You will find additional resources and practical examples that help you get started in this github repository.

The Alex Cluster

To work effectively with a compute cluster, it is essential to be familiar with its high-level architecture and which components it is comprised of. The following sections provide a basic introduction to the Alex cluster by FAU. For more detailed information, see the official documentation.

What is a Cluster?

A (GPU) cluster is a collection of interconnected computers (nodes) that are equipped with Graphics Processing Units (GPUs). These GPUs are specialized hardware accelerators designed for highly parallel computations, making them particularly effective for tasks like deep learning, machine learning, and other computationally intensive applications.

Why Use GPU Clusters?

When scaling up model architectures and the volume of training/evaluation data, computational demands can quickly overwhelm personal computers or local GPUs. In contrast, GPU clusters offer the following advantages:

Scalability (more available resources → train larger models; run many tasks at once)
Efficiency (higher performance resources → train models faster)
Advanced hardware (access cutting-edge hardware)

Additional benefit for students: familiarize yourself with tools and infrastructure used in the industry and gain practical experience

Differences to Local Computing

	Local Compute	GPU Cluster
Hardware	Limited to resources on your personal device	Access to powerful GPUs
Performance	Suitable for small-scale tasks	Optimized for large-scale computations
Accessibility	Direct interaction with your system	Accessed remotely via network connection
Resource Sharing	Dedicated to one user	Shared across multiple users with job scheduling
Scalability	Restricted to the one device	Utilize multiple GPUs (and nodes)

Compute Architecture

On a high level, a cluster can be conceptualized as individual computers (nodes) connected to a central entity and can exchange information. Nonetheless, each node has its own CPU, memory, and storage. A cluster is typically made up of at least two types of nodes:

The login nodes serve as your way to access the cluster. You are automatically connected to one of these nodes when you connect to the cluster. They provide you with a basic interface to interact with the cluster (e.g., file management, coding, submitting jobs). However, they are not intended for serious computational workloads or running jobs. Since they are shared across multiple users, be mindful and avoid running resource-intensive processes on login nodes.

Compute Nodes

The compute/GPU nodes are where computationally demanding tasks are executed. Connecting directly to a compute node is usually not possible and it is first necessary to reserve the resources by scheduling a job from a login node.

The scheduler (e.g., SLURM) manages and allocates the cluster’s resources. User requests are queued and prioritized based on factors like scheduled runtime, resource requirements, time in queue, etc. Once sufficient resources become available, the appropriate node(s) are allocated and usable to run user-specified code.

The GPU nodes on Alex you have access to have the following specs:

2x AMD EPYC CPU (2x 64 cores @2.0 GHz)
8x Nvidia A40 GPU (40GB VRAM)
512 GB memory
7 TB node-local storage

Typically, compute nodes are divided further into smaller compute units. These units represent a subset of the available resources (such as a specific number of CPU cores or GPUs) and are allocated based on user requests. For example, our compute node with 8 GPUs can be partitioned into 8 smaller units (→ 16 CPU cores, 64GB RAM, ~1TB SSD), depending on the task at hand. This partitioning allows multiple users to share the same node without interfering with each other’s work, optimizing resource utilization.

graph TD %% Define User U[User] -->|SSH/Command Access| A[Login Node] %% Define Login Node and Compute Nodes A -->|Job Submission| B1[Compute Node 1] A --> B2[Compute Node 2] A --> B3[Compute Node 3] %% Define GPUs for each compute node subgraph . B1 --> G1[GPU 1] B1 --> G2[GPU 2] B1 --> G3[GPU 3] B1 --> G4[GPU 4] end subgraph . B2 --> G5[GPU 1] B2 --> G6[GPU 2] B2 --> G7[GPU 3] B2 --> G8[GPU 4] end subgraph . B3 --> G9[GPU 1] B3 --> G10[GPU 2] B3 --> G11[GPU 3] B3 --> G12[GPU 4] end

Filesystems and Data Storage

In addition to the storage on the node itself, the cluster has a centralized filesystem that all nodes can access. This allows users to store and manage their data and models. This section will guide you through the cluster’s available filesystems, their intended use, and best practices for managing your data effectively. Understanding these will help you optimize storage, maintain data integrity, and avoid issues such as quota limits or accidental data loss.

Overview

The cluster provides several filesystems, each with specific purposes, storage capacities, and backup policies. Here’s a summary of the most essential ones:

Environment Variable	Purpose	Backup and Snapshots	IO Speed	Quota
$HOME	Important and unrecoverable files (e.g., scripts, results)	Yes	Normal	50 GB
$HPCVAULT	Long-term storage of data	Yes (infrequent)	Slow	500 GB
$WORK	General-purpose files (e.g., logs, intermediate results)	No	Normal	1000 GB
$TMPDIR	Job-specific (!) scratch space	No	Very Fast	None

Accessing Storage

The above file systems (except for the node-local $TMPDIR) are mounted across all cluster nodes. You can access the storage directories from anywhere using their predefined environment variables. Here are two examples:

Bash

Navigate to your $WORK directory: cd $WORK
List all files in your $HPCVAULT directory: ls $HPCVAULT

Python

import os

# Get the path to your $HOME directory
home_path = os.environ.get("HOME")
print(f"Home directory: {home_path}")

Best Practices

Use $HOME for important files due to regular backups and snapshots
Store temporary or large files in $WORK, or $TMPDIR (will be deleted after the job finished!).
Stage your data to node-local $TMPDIR when fast read/write operations are essential (cf)
Regularly monitor quotas (run shownicerquota.pl) and clean up unnecessary files

Recovering Overwritten/Deleted Files

The snapshots taken on $HOME and $HPCVAULT in regular intervals allow you to restore files that may have been deleted:

List the available snapshots in the .snapshots directory of the target folder:

bash $ ls -l $HOME/your-folder/.snapshots/
Copy the desired version back:

bash $ cp '/path/to/.snapshots/@GMT-TIMESTAMP/file' '/path/to/restore/file'

More information about filesystems

Interaction Between Components

Now that you know what kind of basic building blocks make up a cluster, let’s look at a typical workflow to illustrate how these components work together:

Connect to the Login Node Use ssh to remotely access the login node
Implement and setup your experiments Upload and edit datasets, scripts, etc. to the cluster’s shared storage system
Submit a Job Specify the resource requirements and the code to be executed
Scheduler Allocates Ressources Compute node is allocated and your code starts executing
Access Results Log back into the login node and find your results at the respective location of the filesystem

Connecting to the Cluster

Now that you are familiar with compute clusters, let’s try using one.

Connecting to Alex via SSH

We will use the Secure Shell Protocol (SSH) to connect our local machine to the remote cluster. For that, make sure you have a Terminal/PowerShell or SSH client like PuTTY handy. However, before you can connect, there are a couple of setup steps necessary still.

Accepting Cluster Invitation

By now, you should have received an invitation to the project “UTN AI&Robotics MSc Studierende” on the HPC-Portal. Follow these steps to accept the invitation:

Open the HPC-Portal and log in via your UTN account
Go to “User” > “Your Invitations” and accept the invitation

Configuring SSH Keys

To be able to access the cluster, authentication of your user is necessary. For this, we will be using SSH key pairs. Follow these steps for generating and uploading the keys:

Generate an SSH key pair (secret + public key) on your local machine with ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519_nhr_fau You will be asked to (optionally) enter a passphrase
Print out the public key using cat ~/.ssh/id_ed25519_nhr_fau.pub and copy the output
Open the HPC-Portal and navigate to the “User” tab in your account settings and click on “Add new SSH key” button. Enter an alias name of your choice and paste the copied output of the key file.

Editing SSH Config

The Alex cluster is part of the bigger FAU high-performance compute center which uses a central dialog server from which a so-called proxy jump to the GPU cluster is performed. For this to work, we need to add some configurations to the ~/.ssh/config file on your machine. This will ensure the cluster address, authentication method, etc. are all setup correctly when you try to connect.

Replace <HPC account> with your user account (see HPC-Portal under Your accounts > Active accounts > HPC-Account) in the following template and paste it to your ~/.ssh/config file:

Host csnhr.nhr.fau.de csnhr
    HostName csnhr.nhr.fau.de
    User <HPC account>
    IdentityFile ~/.ssh/id_ed25519_nhr_fau
    IdentitiesOnly yes
    PasswordAuthentication no
    PreferredAuthentications publickey
    ForwardX11 no
    ForwardX11Trusted no

Host alex.nhr.fau.de alex
    HostName alex.nhr.fau.de
    User <HPC account>
    ProxyJump csnhr.nhr.fau.de
    IdentityFile ~/.ssh/id_ed25519_nhr_fau
    IdentitiesOnly yes
    PasswordAuthentication no
    PreferredAuthentications publickey
    ForwardX11 no
    ForwardX11Trusted no

Notes about connecting via SSH …

The user name you replace in the config file template will differ from the one used to log into the HPC-Portal. Be sure to use the one specified Your accounts > Active accounts in the HPC-Portal.
The file name and location must be ~/.ssh/config (no file extension like .txt). You can edit the file using nano or your text editor of choice.
After submitting the SSH key in the HPC-Portal, it might take up to two hours for the key to be distributed to all systems. During that time, you may not be able to log in yet.

Testing Your Connection

If everything worked correctly, you should now be able to connect to the cluster. For that, first connect to the dialog server using the following command:

ssh csnhr.nhr.fau.de

If this is your first time connecting to this remote address, you will be asked whether you’d like to continue. Confirm by entering yes. By typing in the exit command, you can close the connection and return to your local shell.

Now, connect to the Alex cluster directly using ssh alex. Again, you might have to confirm your intent with yes the first time you try to connect.

Transferring Files (rsync/scp)

In case you want to manually transfer files from your local system to the cluster or vice versa, you can use the scp command:

# Local file -> cluster
scp -r path/to/file <USERNAME>@<CLUSTER>:/path/to/destination
# Remote file -> local machine
scp -r <USERNAME>@<CLUSTER>:/path/to/file path/to/destination

Be sure to correctly replace the <USERNAME> and <CLUSTER> tags with your username and the cluster address (e.g., alex), respectively. Also, specify the file/folder you want to copy.

For larger transfers, it can be recommended to use rsync :

rsync -avz path/to/file <USERNAME>@<CLUSTER>:/path/to/destination

Hint: by adding the exclude flag (e.g., --exclude *.pt), it is possible to ignore certain files or file types in the download. This can be handy when you only want to transfer your code without the large model checkpoints.

Local vs. Remote Development

When working with a cluster, you have two main options for developing and testing your code: local or remote development. Each has its strengths and weaknesses, and your workflow can depend on your project requirements or personal preferences. Here is an overview of both:

Local Development - Implement Locally, Run on Cluster

Workflow:

Implement and test your code locally on your personal computer
Once the code runs successfully locally, commit and push the changes to the cluster (e.g., via git) for running large-scale experiments

Pros:

Quick prototyping and debugging (no waiting for resource allocations)
Resource efficiency (doesn’t take up GPUs on the cluster while implementing and testing)
Obligatory version control (regular commits are necessary)

Cons:

Limited resources (infeasible for models requiring more resources to be loaded/run)
Different environment (local environment may differ from that of cluster (e.g., CPU vs. GPU, CUDA versions, etc.))

Remote Development - Develop and Run on Cluster

Workflow:

Connect to the cluster directly and implement everything there using and IDE/editor like VS Code or PyCharm (or command line tools like vim or nano)
Execute and debug all experiments on the cluster itself

Pros:

Consistent environment (development and execution happens in same software and hardware environment)
Less management overhead (no need to maintain and manage two separate code and data instances)

Cons:

Connectivity dependency (requires internet connection at all times)
Resource availability (you might need to wait for resources to get allocated)
Inefficient use of resources (testing and debugging might drastically reduce resource utilization)

Setting up the Environment

Once you are authenticated and logged into the cluster, the next step is to set up your environment further to ensure you can use the necessary software and GPU resources.

Modules

Environment modules provide different versions of compilers, applications, and environments and are used by many clusters to dynamically manage software environments. They allow you to load and unload software packages, which ensures compatibility and avoids conflicts between software versions while being easy to use.

Here are the basic commands to work with the modules system:

module avail: shows all software modules available on the cluster
module load/unload <module_name>: load or unload the module you want to use (e.g., python)
module list: shows all software modules currently loaded in the environment
module purge: reset to a clean environment

Use the modules package to make Python available by running module load python.

Conda

Once you have loaded the appropriate Python module, creating an isolated Python environment for each project using Conda or Python’s venvs is good practice. This ensures your dependencies are self-contained and avoids interference with other users or projects.

A conda installation is already available through the python module you just loaded. Before you use conda for the first time, run the following commands to correctly setup where the environment and packages will be stored:

if [ ! -f ~/.bash_profile ]; then
  echo "if [ -f ~/.bashrc ]; then . ~/.bashrc; fi" > ~/.bash_profile
fi
module add python
conda config --add pkgs_dirs $WORK/software/private/conda/pkgs
conda config --add envs_dirs $WORK/software/private/conda/envs

This has only to be done once.

You can now create a new environment with the following command:

conda create --name <ENV_NAME> python=<PYTHON_VERSION>

Alternatively, you can of course also use an existing environment configuration as a basis (i.e., conda env create -f environment.yml or conda create --name <ENV_NAME> --file requirements.txt). Once activated (conda activate <ENV_NAME>), you can install libraries using pip or conda. Install packages using an (interactive) job to ensure the hardware is correctly supported. As the compute nodes are not configured with internet access by default, it is important to configure the proxy so that additional software can be downloaded:

export http_proxy=http://proxy:80
export https_proxy=http://proxy:80

For convenience, you may add these statements to your .bashrc file (which is called every time a new bash shell is initiated).

Scheduling Jobs

SLURM (Simple Linux Utility for Resource Management) is a workflow manager for scheduling jobs and managing cluster resources. It allows users to request specific resources (like GPUs, CPUs, or memory) and submit jobs to the cluster’s job queue.

SLURM Basics

Before submitting jobs, it is crucial to understand the following SLURM commands:

sinfo: Provides information about the cluster’s nodes and their availability. The output may look something like this and tells you which nodes are available and their current status (e.g., maintenance, reserved, allocated, idle, etc.):

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
a40*         up 1-00:00:00      1  maint a1722
a40*         up 1-00:00:00      1   resv a1721
a40*         up 1-00:00:00      8    mix a[0121,0126,0128-0129,0225,0229,0329,0423]
a40*         up 1-00:00:00     26  alloc a[0122-0125,0127,0221-0224,0226-0228,0321-0328,0421-0422,1621-1624]
a40*         up 1-00:00:00      8   idle a[0424-0429,0521-0522]

squeue: Shows all of your scheduled jobs and their current status (queued/running). ST: job status (R = Running, PD = Pending) TIME: elapsed time since resource allocation NODELIST(REASON): allocated node or reason for the job pending

    JOBID PARTITION NAME                 USER     ST       TIME  TIME_LIMIT  NODES CPUS NODELIST(REASON)
    1234567       a40 run_experiment       <USERID> PD       0:00    23:59:00      1   16 (Priority)
    1234568       a40 run_experiment       <USERID>  R    2:39:34    23:59:00      1   16 a0603

sbatch: Used to submit batch jobs to the SLURM scheduler (e.g., sbatch my_job_file.job). Once submitted, the job will be added to the queue and get assigned a unique job ID.
scancel: Cancels a specific job given its job ID (e.g., scancel 1234568).

Job Files

To run an experiment or script, you must create a job submission script specifying the resources required and the commands to execute. Here’s a simple job file that prints whether CUDA is available in PyTorch:

#!/bin/bash -l
#SBATCH --job-name=run_experiment   # Name of Job
#SBATCH --output=results_%A.out     # File where outputs/errors will be saved
#SBATCH --time=00:59:00             # Time limit (hh:mm:ss)
#SBATCH --ntasks=1                  # Number of tasks
#SBATCH --gres=gpu:a40:1            # Request 1 GPU
#SBATCH --nodes=1                   # Request 1 node

module purge
module load python # load preinstalled python module (includes conda) 
conda activate <ENV_NAME> # activate environment

python -c "import torch; print('cuda available:', torch.cuda.is_available())"

You can specify higher resource requests like this:

# Specifying the number of GPUs
#SBATCH --gres=gpu:a40:1 # Request 1 GPU
#SBATCH --gres=gpu:a40:4 # Request 4 GPUs

# Specifying the number of nodes
#SBATCH --nodes=1  # Request 1 node
#SBATCH --nodes=2  # Request 2 nodes

In the job file, specify the experiments/code you want to execute. Afterwards, you can submit the job via the sbatch command.

Job Arrays

Batch arrays enable you to run multiple similar jobs in one go. This can be useful when iterating over different model configurations (e.g., hyperparameter tuning) and is more convenient than manually managing and scheduling various jobs yourself.

With job arrays, SLURM enables you to run the same program with different parameters. The following example executes the experiment.py file with different learning rate parameters:

#!/bin/bash -l
#SBATCH --job-name=run_experiment
#SBATCH --output=results_%A_%a.out
#SBATCH --time=00:59:00
#SBATCH --array=0-4
#SBATCH --ntasks=1
#SBATCH --gres=gpu:a40:1
#SBATCH --nodes=1

# Setup environment
module purge
module load python
conda activate <ENV_NAME>

# Define parameters for each task
LEARNING_RATES=("0.01" "0.001" "0.0001" "0.00001" "0.000001")
LR=${LEARNING_RATES[$SLURM_ARRAY_TASK_ID]}

# Run the program with the specified parameter
python experiment.py --lr=$LR

Here, $SLURM_ARRAY_TASK_ID is automatically set to the corresponding task ID. The total number of tasks is stored in $SLURM_ARRAY_TASK_COUNT. You can additionally specify the maximum number of concurrent jobs using the % operator (e.g., --array=0-7%4 creates 8 tasks but only runs 4 of them at any time). The output is saved to a text file named results_<job_id>_<task_id> indicated by %A and %a, respectively.

For a more detailed introduction and advanced use cases see the official SLURM documentation. Another handy way of organizing and executing experiments with job arrays for hyperparameter sweeps is suggested by Phillip Lippe. Instead of specifying the hyperparameters in the job file itself, it can be recomendable to use a configuration management framework like Hydra.

Interactive Sessions

Interactive sessions allow you to access a compute node directly for debugging or experimentation. Unlike normal jobs, interactive sessions make it possible to continuously execute code without terminating the resource allocation after the initial script concludes. This can be useful for verifying the correctness of commands or debugging where waiting for resource allocation after every small change is unfeasible.

You can request an interactive session using the following command (same parameters as regular job file):

salloc --gres=gpu:a40:1 --time=3:59:00

As soon as the resources are allocated, you are connected to the compute node and can directly interact with it via the command line (e.g., by running module load python when planning to use Python).

However, keep in mind two important points when using interactive sessions:

When you disconnect from Alex (e.g., unstable internet connection, closing your laptop), the session will be canceled, and whatever you are currently running will be interrupted
The session does not automatically end once your script is finished running, and it may be idle for extended periods. Therefore, interactive sessions are less efficient and should only be used when necessary.

Attaching to Running Jobs

If you want to monitor the resource utilization of your experiment, it is necessary to connect to the node allocated to your job. The following command allows you to attach to a running job from a login node:

srun --jobid=<JOB_ID> --overlap --pty /bin/bash -l

After replacing <JOB_ID> with your specific job ID (can be obtained via squeue), you can use tools like nvidia-smi or htop to check how the available resources are utilized.

Best Practices

Test Your Code

Make sure your code runs before scheduling large-scale experiments (e.g., test locally or with small subset of data). It is frustrating to have jobs fail mid-way, especially when the cluster is busy and allocation might take a while (also, we don’t want to waste our scarce GPU resources 😢).

Monitor Your Resource Utilization

Even when your code runs without errors and produces the expected results, it is still possible that it doesn’t make use of the hardware resources effectively. To diagnose such issues, you can use the ClusterCockpit that shows you the resource usage over the runtime of your jobs (e.g., CPU/GPU load and memory usage). That way, you can identify potnential bottleneck and implement your approach more efficiently. When launching for the first time, go to User > Your Accounts > External Tools in the HPC Portal.

Only Request Resources You Need

Make sure to only request as many resources as you actually need. Besides not taking up GPUs that others could use, there’s also something in for you: the shorter the job and the fewer resources required, the faster resources will be allocated. That said, requesting too little (esp. runtime) may cause your experiments to be terminated too early or fail due to out-of-memory (OOM) errors. Knowing the requirements of your setup is important and factor in a little leeway.

Use Version Control

Despite some of the filesystem being backed up and creating regular snapshots, a risk of losing data remains. Therefore, be sure to regularly commit and push changes made to your codebase (e.g., via github).

Data Staging

Staging your data can be helpful when working with models or data that require frequent IO operations (e.g., regularly loading many files). At the beginning of your job, copy/extract your data to the node-local and very fast $TMPDIR directory before executing your experiments. Compared to directly loading from the original directory (e.g., $WORK), this additional step can significantly speed up some workflows and improve resource utilization.

Troubleshooting and FAQs

Let me know if you encounter issues that other people might have, and I can add them here.

Acknowledgements

This tutorial and the contents provided are partially built upon the following resources:

“Working with the Snellius cluster”, a guide by Phillip Lippe and Danilo de Goede
Official Documentation by the HPC people at FAU

Thanks to Lukas Knobel for helpful comments and discussions.

Now that we ended …

Your feedback is important! If you encounter any issues while following the tutorial or have suggestions for improvements, please don’t hesitate to reach out. Additionally, if there are any, we’d love to hear your ideas on topics you’d like to see covered in future tutorials or documentation, we’d love to reach out. You can find my contact details here. Thank you for using this resource, and I hope it helped you get started using clusters for your own projects.

Slides available here Practical resources available here

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search

Working with Clusters for Deep Learning

The Alex Cluster

What is a Cluster?

Why Use GPU Clusters?

Differences to Local Computing

Compute Architecture

Login Nodes

Compute Nodes

Filesystems and Data Storage

Overview

Accessing Storage

Best Practices

Recovering Overwritten/Deleted Files

Interaction Between Components

Connecting to the Cluster

Connecting to Alex via SSH

Accepting Cluster Invitation

Configuring SSH Keys

Editing SSH Config

Testing Your Connection

Transferring Files (rsync/scp)

Local vs. Remote Development

Local Development - Implement Locally, Run on Cluster

Remote Development - Develop and Run on Cluster

Setting up the Environment

Modules

Conda

Scheduling Jobs

SLURM Basics

Job Files

Job Arrays

Interactive Sessions

Attaching to Running Jobs

Best Practices

Test Your Code

Monitor Your Resource Utilization

Only Request Resources You Need

Use Version Control

Data Staging

Troubleshooting and FAQs

Acknowledgements