SLURM (Simple Linux Utility for Resource Management) is a workload manager used in the TU Dublin HPC cluster. It helps allocate resources, schedule jobs, and manage the execution of tasks across the cluster.
Important: All experiments should be launched via the sbatch command. NEVER run your own scripts directly on the command line – it will not work as expected and may lead to poor resource utilization.
SLURM offers several benefits:
Before running your actual experiments, it’s good practice to run a simple test to ensure SLURM is working properly.
Create a file called launch_test.sh with the following content:
#!/bin/sh #SBATCH --job-name=test #SBATCH --mem=100 #SBATCH --cpus-per-task=1 srun sleep 60
This script simply asks SLURM to run the Linux sleep command for 60 seconds. The SBATCH lines at the top provide instructions to SLURM:
--job-name: A name to identify your job--mem: Amount of memory needed in MB--cpus-per-task: Number of CPU cores neededThe srun command triggers the actual job execution.
Submit the job to SLURM using:
If successful, SLURM will return a job ID number.
Monitor your job using:
This shows all jobs in the queue, including yours. You should see your job with status “R” (running) or “PD” (pending).
To see the status of the compute nodes:
This shows the available partitions and node states (up, down, allocated, etc.).
For real-world jobs, you’ll typically create more complex SBATCH scripts.
Here’s a template for running Python experiments:
#!/bin/sh #SBATCH --job-name=my_experiment #SBATCH --gres=gpu:1 #SBATCH --mem=8000 #SBATCH --cpus-per-task=4 #SBATCH --partition=medium-g2 # Activate your virtual environment . /path/to/your/venv/bin/activate # Run your Python script python -u your_script.py --arg1=value --arg2=value
Key parameters explained:
--job-name: Name for your job--gres=gpu:1: Request 1 GPU (remove if not needed)--mem=8000: Request 8GB of RAM--cpus-per-task=4: Request 4 CPU cores--partition=medium-g2: Target specific partitionThe -u flag for Python makes the output unbuffered, so you can see print statements in real-time in the output file.
| Parameter | Description | Example |
|---|---|---|
| –job-name | Name to identify your job | –job-name=resnet_training |
| –gres | Generic resource request (for GPUs) | –gres=gpu:1 |
| –mem | Memory requirement in MB | –mem=16000 |
| –cpus-per-task | Number of CPU cores | –cpus-per-task=4 |
| –partition | Specific node group to target | –partition=medium-g2 |
| –output | Output file path | –output=results/%j.out |
| –error | Error file path | –error=results/%j.err |
| –time | Time limit (HH:MM:SS) | –time=24:00:00 |
Since the cluster is shared among many users with different library needs, use virtual environments to manage your Python dependencies.
Create a new virtual environment using:
Activate the environment:
Update pip and install your packages:
In your SLURM script, activate the environment using . ~/my_environment/bin/activate (note the dot instead of “source”).
Modern deep learning frameworks like TensorFlow and PyTorch include GPU support by default. Simply installing them with pip will enable GPU functionality:
To verify your code is using the GPU, look for CUDA-related messages in the output or check explicitly:
import tensorflow as tf
print(tf.config.list_physical_devices('GPU'))
import torch
print(torch.cuda.is_available())
print(torch.cuda.device_count())
The cluster is divided into partitions based on hardware capabilities:
| Partition | Description | Use Case |
|---|---|---|
| small-g1 | Older GPUs with less memory | Testing, small models |
| medium-g1 | Older GPUs with medium memory | Medium-sized models, testing |
| small-g2 | Modern GPUs with less memory | Production, smaller models |
| medium-g2 | Modern GPUs with 8-12GB memory | Production, medium models |
| large-g2 | Modern GPUs with 16+ GB memory | Large models, high memory tasks |
| DEV | Development partition | Code testing and debugging |
For initial testing, use --partition=DEV or --partition=small-g1. For production runs, choose a suitable g2 partition based on your memory requirements.
python -u for real-time outputsqueue--mempython -u for unbuffered outputscontrol show job JOBID for details| Command | Description |
|---|---|
sbatch script.sh |
Submit a job |
squeue |
View all jobs in queue |
squeue -u username |
View your jobs only |
scancel JOBID |
Cancel a specific job |
scancel -u username |
Cancel all your jobs |
sinfo |
View partition and node information |
scontrol show job JOBID |
View detailed information about a job |
scontrol show node nodename |
View detailed information about a node |