Local Compute	GPU Cluster
Hardware	Limited to resources on your personal device	Access to powerful GPUs
Performance	Suitable for small-scale tasks	Optimized for large-scale computations
Accessibility	Direct interaction with your system	Accessed remotely via network connection
Resource Sharing	Dedicated to one user	Shared across multiple users with job scheduling
Scalability	Restricted to the one device	Utilize multiple GPUs (and nodes)

Environment Variable	Purpose	Backup and Snapshots	IO Speed	Quota
$HOME	Important and unrecoverable files (e.g., scripts, results)	Yes	Normal	50 GB
$HPCVAULT	Long-term storage of data	Yes (infrequent)	Slow	500 GB
$WORK	General-purpose files (e.g., logs, intermediate results)	No	Normal	1000 GB
$TMPDIR	Job-specific (!) scratch space	No	Very Fast	None

### Remote Development with VS Code 1. Ensure you have VS Code installed on your local machine. 2. In VS Code, install the “Remote Development” extension package. 3. In the command pallete (cmd/strg + P), select “Remote-SSH: Connect to Host …”. 4. If the configuration you created previously is correct, the Alex cluster should already show up as an selectable option. 5. Now, you can use the familiar VS Code interface but interact with files and the command line on the cluster instead of on your machine. !!! note "About debugging ..." If you want to debug and execute code using your VS Code environment, be sure to connect to an interactive node instead of running it on the login node. For more information about remote development using SSH, [see the official documentation](https://code.visualstudio.com/docs/remote/ssh). ---

Working with Clusters for Deep Learning

The Alex Cluster

What is a Cluster?

Why Use GPU Clusters?

Differences to Local Computing

Compute Architecture

Compute Nodes

Filesystems and Data Storage

Overview

Accessing Storage

Best Practices

Recovering Overwritten/Deleted Files

Interaction Between Components

Connecting to the Cluster

Connecting to Alex via SSH

Accepting Cluster Invitation

Configuring SSH Keys

Editing SSH Config

Testing Your Connection

Transferring Files (rsync/scp)

Local vs. Remote Development

Local Development - Implement Locally, Run on Cluster

Remote Development - Develop and Run on Cluster

Setting up the Environment

Modules

Conda

Scheduling Jobs

SLURM Basics

Job Files

Interactive Sessions

Attaching to Running Jobs

Best Practices

Test Your Code

Monitor Your Resource Utilization

Only Request Resources You Need

Use Version Control

Data Staging

Troubleshooting and FAQs

Acknowledgements

Working with Clusters for Deep Learning

The Alex Cluster

What is a Cluster?

Why Use GPU Clusters?

Differences to Local Computing

Compute Architecture

Login Nodes

Compute Nodes

Filesystems and Data Storage

Overview

Accessing Storage

Best Practices

Recovering Overwritten/Deleted Files

Interaction Between Components

Connecting to the Cluster

Connecting to Alex via SSH

Accepting Cluster Invitation

Configuring SSH Keys

Editing SSH Config

Testing Your Connection

Transferring Files (rsync/scp)

Local vs. Remote Development

Local Development - Implement Locally, Run on Cluster

Remote Development - Develop and Run on Cluster

Setting up the Environment

Modules

Conda

Scheduling Jobs

SLURM Basics

Job Files

Interactive Sessions

Attaching to Running Jobs

Best Practices

Test Your Code

Monitor Your Resource Utilization

Only Request Resources You Need

Use Version Control

Data Staging

Troubleshooting and FAQs

Acknowledgements