Compute Architecture
On a high level, a cluster can be conceptualized as individual computers (nodes) connected to a central entity and can exchange information. Nonetheless, each node has its own CPU, memory, and storage. A cluster is typically made up of at least two types of nodes:
Login Nodes
The login nodes serve as your way to access the cluster. You are automatically connected to one of these nodes when you connect to the cluster. They provide you with a basic interface to interact with the cluster (e.g., file management, coding, submitting jobs). However, they are not intended for serious computational workloads or running jobs. Since they are shared across multiple users, be mindful and avoid running resource-intensive processes on login nodes.
Compute Nodes
The compute/GPU nodes are where computationally demanding tasks are executed. Connecting directly to a compute node is usually not possible and it is first necessary to reserve the resources by scheduling a job from a login node.
The scheduler (e.g., SLURM) manages and allocates the cluster's resources. User requests are queued and prioritized based on factors like scheduled runtime, resource requirements, time in queue, etc. Once sufficient resources become available, the appropriate node(s) are allocated and usable to run user-specified code.
The GPU nodes on Alex you have access to have the following specs:
- 2x AMD EPYC CPU (2x 64 cores @2.0 GHz)
- 8x Nvidia A40 GPU (40GB VRAM)
- 512 GB memory
- 7 TB node-local storage
Typically, compute nodes are divided further into smaller compute units. These units represent a subset of the available resources (such as a specific number of CPU cores or GPUs) and are allocated based on user requests. For example, our compute node with 8 GPUs can be partitioned into 8 smaller units (→ 16 CPU cores, 64GB RAM, ~1TB SSD), depending on the task at hand. This partitioning allows multiple users to share the same node without interfering with each other’s work, optimizing resource utilization.