{"id":1432,"date":"2025-09-14T17:47:38","date_gmt":"2025-09-14T17:47:38","guid":{"rendered":"https:\/\/ascnet.ie\/HPC_Research\/?page_id=1432"},"modified":"2025-09-15T15:41:03","modified_gmt":"2025-09-15T15:41:03","slug":"slurm-guide","status":"publish","type":"page","link":"https:\/\/ascnet.ie\/HPC_Research\/slurm-guide\/","title":{"rendered":"HPC Services SLURM Guide"},"content":{"rendered":"\n<body class=\"content-hidden\">\n    <header class=\"tu-header\">\n        <h1>HPC Services SLURM Guide<\/h1>\n        <div class=\"last-update\">Last Update: <span id=\"last-update\">Loading&#8230;<\/span><\/div>\n    <\/header>\n    \n    <main class=\"dashboard-container\">\n        <div class=\"internal-nav\">\n            <a href=\"\/HPC_Research\/overview\" class=\"nav-item\">Overview<\/a>\n            <a href=\"\/HPC_Research\/access-guide\" class=\"nav-item\">Access Guide<\/a>\n            <a href=\"\/HPC_Research\/slurm-guide\" class=\"nav-item active\">SLURM Guide<\/a>\n            <a href=\"\/HPC_Research\/dashboard\" class=\"nav-item\">Cluster Dashboard<\/a>\n            <a href=\"\/HPC_Research\/leased-resources\" class=\"nav-item\">Leased Resources<\/a>\n        <\/div>\n\n        <div id=\"content-container\">\n            <div class=\"access-message\">\n                <h2>Checking access to HPC resources&#8230;<\/h2>\n                <div class=\"loading-spinner\"><\/div>\n                <p>Connecting to HPC monitoring system&#8230;<\/p>\n            <\/div>\n        <\/div>\n    <\/main>\n\n    <footer class=\"tu-footer\">\n        <p>TU Dublin HPC Monitor &#8211; Internal Use Only<\/p>\n    <\/footer>\n\n    <template id=\"hpc-content-template\">\n        <section class=\"card\">\n            <h3>Introduction to SLURM<\/h3>\n            <p>SLURM (Simple Linux Utility for Resource Management) is a workload manager used in the TU Dublin HPC cluster. It helps allocate resources, schedule jobs, and manage the execution of tasks across the cluster.<\/p>\n            \n            <div class=\"warning-box\">\n                <p><strong>Important:<\/strong> All experiments should be launched via the sbatch command. NEVER run your own scripts directly on the command line &#8211; it will not work as expected and may lead to poor resource utilization.<\/p>\n            <\/div>\n            \n            <p>SLURM offers several benefits:<\/p>\n            <ul>\n                <li>Efficient resource allocation across multiple users<\/li>\n                <li>Job scheduling based on resource availability<\/li>\n                <li>Fair usage policies and prioritization<\/li>\n                <li>Detailed monitoring and statistics<\/li>\n            <\/ul>\n        <\/section>\n        \n        <section class=\"card\">\n            <h3>Getting Started with SLURM<\/h3>\n            <p>Before running your actual experiments, it&#8217;s good practice to run a simple test to ensure SLURM is working properly.<\/p>\n            \n            <div class=\"guide-section\">\n                <h4><span class=\"step-number\">1<\/span> Create a Test Script<\/h4>\n                <p>Create a file called <code>launch_test.sh<\/code> with the following content:<\/p>\n                <pre>#!\/bin\/sh\n#SBATCH --job-name=test\n#SBATCH --mem=100\n#SBATCH --cpus-per-task=1\nsrun sleep 60<\/pre>\n                <p>This script simply asks SLURM to run the Linux <code>sleep<\/code> command for 60 seconds. The SBATCH lines at the top provide instructions to SLURM:<\/p>\n                <ul>\n                    <li><code>--job-name<\/code>: A name to identify your job<\/li>\n                    <li><code>--mem<\/code>: Amount of memory needed in MB<\/li>\n                    <li><code>--cpus-per-task<\/code>: Number of CPU cores needed<\/li>\n                <\/ul>\n                <p>The <code>srun<\/code> command triggers the actual job execution.<\/p>\n            <\/div>\n            \n            <div class=\"guide-section\">\n                <h4><span class=\"step-number\">2<\/span> Submit the Test Job<\/h4>\n                <p>Submit the job to SLURM using:<\/p>\n                <div class=\"command-example\">\n                    sbatch launch_test.sh\n                <\/div>\n                <p>If successful, SLURM will return a job ID number.<\/p>\n            <\/div>\n            \n            <div class=\"guide-section\">\n                <h4><span class=\"step-number\">3<\/span> Check Job Status<\/h4>\n                <p>Monitor your job using:<\/p>\n                <div class=\"command-example\">\n                    squeue\n                <\/div>\n                <p>This shows all jobs in the queue, including yours. You should see your job with status &#8220;R&#8221; (running) or &#8220;PD&#8221; (pending).<\/p>\n            <\/div>\n            \n            <div class=\"guide-section\">\n                <h4><span class=\"step-number\">4<\/span> Check Cluster Status<\/h4>\n                <p>To see the status of the compute nodes:<\/p>\n                <div class=\"command-example\">\n                    sinfo\n                <\/div>\n                <p>This shows the available partitions and node states (up, down, allocated, etc.).<\/p>\n            <\/div>\n        <\/section>\n        \n        <section class=\"card\">\n            <h3>Running Jobs with SLURM<\/h3>\n            <p>For real-world jobs, you&#8217;ll typically create more complex SBATCH scripts.<\/p>\n            \n            <div class=\"guide-section\">\n                <h4>Python Experiments Template<\/h4>\n                <p>Here&#8217;s a template for running Python experiments:<\/p>\n                <pre>#!\/bin\/sh\n#SBATCH --job-name=my_experiment\n#SBATCH --gres=gpu:1\n#SBATCH --mem=8000\n#SBATCH --cpus-per-task=4\n#SBATCH --partition=medium-g2\n\n# Activate your virtual environment\n. \/path\/to\/your\/venv\/bin\/activate\n\n# Run your Python script\npython -u your_script.py --arg1=value --arg2=value<\/pre>\n                <p>Key parameters explained:<\/p>\n                <ul>\n                    <li><code>--job-name<\/code>: Name for your job<\/li>\n                    <li><code>--gres=gpu:1<\/code>: Request 1 GPU (remove if not needed)<\/li>\n                    <li><code>--mem=8000<\/code>: Request 8GB of RAM<\/li>\n                    <li><code>--cpus-per-task=4<\/code>: Request 4 CPU cores<\/li>\n                    <li><code>--partition=medium-g2<\/code>: Target specific partition<\/li>\n                <\/ul>\n                <div class=\"note-box\">\n                    <p>The <code>-u<\/code> flag for Python makes the output unbuffered, so you can see print statements in real-time in the output file.<\/p>\n                <\/div>\n            <\/div>\n            \n            <div class=\"guide-section\">\n                <h4>Key SLURM Parameters<\/h4>\n                <div class=\"resource-table\">\n                    <table>\n                        <thead>\n                            <tr>\n                                <th>Parameter<\/th>\n                                <th>Description<\/th>\n                                <th>Example<\/th>\n                            <\/tr>\n                        <\/thead>\n                        <tbody>\n                            <tr>\n                                <td>&#8211;job-name<\/td>\n                                <td>Name to identify your job<\/td>\n                                <td>&#8211;job-name=resnet_training<\/td>\n                            <\/tr>\n                            <tr>\n                                <td>&#8211;gres<\/td>\n                                <td>Generic resource request (for GPUs)<\/td>\n                                <td>&#8211;gres=gpu:1<\/td>\n                            <\/tr>\n                            <tr>\n                                <td>&#8211;mem<\/td>\n                                <td>Memory requirement in MB<\/td>\n                                <td>&#8211;mem=16000<\/td>\n                            <\/tr>\n                            <tr>\n                                <td>&#8211;cpus-per-task<\/td>\n                                <td>Number of CPU cores<\/td>\n                                <td>&#8211;cpus-per-task=4<\/td>\n                            <\/tr>\n                            <tr>\n                                <td>&#8211;partition<\/td>\n                                <td>Specific node group to target<\/td>\n                                <td>&#8211;partition=medium-g2<\/td>\n                            <\/tr>\n                            <tr>\n                                <td>&#8211;output<\/td>\n                                <td>Output file path<\/td>\n                                <td>&#8211;output=results\/%j.out<\/td>\n                            <\/tr>\n                            <tr>\n                                <td>&#8211;error<\/td>\n                                <td>Error file path<\/td>\n                                <td>&#8211;error=results\/%j.err<\/td>\n                            <\/tr>\n                            <tr>\n                                <td>&#8211;time<\/td>\n                                <td>Time limit (HH:MM:SS)<\/td>\n                                <td>&#8211;time=24:00:00<\/td>\n                            <\/tr>\n                        <\/tbody>\n                    <\/table>\n                <\/div>\n            <\/div>\n        <\/section>\n        \n        <section class=\"card\">\n            <h3>Virtual Environments and Python<\/h3>\n            <p>Since the cluster is shared among many users with different library needs, use virtual environments to manage your Python dependencies.<\/p>\n            \n            <div class=\"guide-section\">\n                <h4>Creating a Virtual Environment<\/h4>\n                <p>Create a new virtual environment using:<\/p>\n                <div class=\"command-example\">\n                    python3.10 -m venv my_environment\n                <\/div>\n                <p>Activate the environment:<\/p>\n                <div class=\"command-example\">\n                    source ~\/my_environment\/bin\/activate\n                <\/div>\n                <p>Update pip and install your packages:<\/p>\n                <div class=\"command-example\">\n                    python -m pip install &#8211;upgrade pip\n                    python -m pip install tensorflow torch numpy pandas\n                <\/div>\n                <div class=\"note-box\">\n                    <p>In your SLURM script, activate the environment using <code>. ~\/my_environment\/bin\/activate<\/code> (note the dot instead of &#8220;source&#8221;).<\/p>\n                <\/div>\n            <\/div>\n            \n            <div class=\"guide-section\">\n                <h4>GPU Support<\/h4>\n                <p>Modern deep learning frameworks like TensorFlow and PyTorch include GPU support by default. Simply installing them with pip will enable GPU functionality:<\/p>\n                <div class=\"command-example\">\n                    python -m pip install tensorflow\n                    python -m pip install torch\n                <\/div>\n                <p>To verify your code is using the GPU, look for CUDA-related messages in the output or check explicitly:<\/p>\n                <pre>import tensorflow as tf\nprint(tf.config.list_physical_devices('GPU'))\n\nimport torch\nprint(torch.cuda.is_available())\nprint(torch.cuda.device_count())<\/pre>\n            <\/div>\n        <\/section>\n        \n        <section class=\"card\">\n            <h3>SLURM Partitions<\/h3>\n            <p>The cluster is divided into partitions based on hardware capabilities:<\/p>\n            \n            <div class=\"resource-table\">\n                <table>\n                    <thead>\n                        <tr>\n                            <th>Partition<\/th>\n                            <th>Description<\/th>\n                            <th>Use Case<\/th>\n                        <\/tr>\n                    <\/thead>\n                    <tbody>\n                        <tr>\n                            <td>small-g1<\/td>\n                            <td>Older GPUs with less memory<\/td>\n                            <td>Testing, small models<\/td>\n                        <\/tr>\n                        <tr>\n                            <td>medium-g1<\/td>\n                            <td>Older GPUs with medium memory<\/td>\n                            <td>Medium-sized models, testing<\/td>\n                        <\/tr>\n                        <tr>\n                            <td>small-g2<\/td>\n                            <td>Modern GPUs with less memory<\/td>\n                            <td>Production, smaller models<\/td>\n                        <\/tr>\n                        <tr>\n                            <td>medium-g2<\/td>\n                            <td>Modern GPUs with 8-12GB memory<\/td>\n                            <td>Production, medium models<\/td>\n                        <\/tr>\n                        <tr>\n                            <td>large-g2<\/td>\n                            <td>Modern GPUs with 16+ GB memory<\/td>\n                            <td>Large models, high memory tasks<\/td>\n                        <\/tr>\n                        <tr>\n                            <td>DEV<\/td>\n                            <td>Development partition<\/td>\n                            <td>Code testing and debugging<\/td>\n                        <\/tr>\n                    <\/tbody>\n                <\/table>\n            <\/div>\n            <p>For initial testing, use <code>--partition=DEV<\/code> or <code>--partition=small-g1<\/code>. For production runs, choose a suitable g2 partition based on your memory requirements.<\/p>\n        <\/section>\n        \n        <section class=\"card\">\n            <h3>Best Practices<\/h3>\n            <ul>\n                <li><strong>Request only what you need<\/strong>: Any resources you request will be exclusive to your job and unavailable to others<\/li>\n                <li><strong>Check before requesting GPUs<\/strong>: Not all code benefits from GPUs &#8211; only request them if your code is designed to use them<\/li>\n                <li><strong>Use virtualenv\/venv<\/strong>: Always use virtual environments for Python dependencies<\/li>\n                <li><strong>Limit to 2 jobs<\/strong>: Current fair usage policy allows up to 2 running jobs and 2 queued jobs per user<\/li>\n                <li><strong>Output unbuffered<\/strong>: Use <code>python -u<\/code> for real-time output<\/li>\n                <li><strong>Save results frequently<\/strong>: Jobs may be terminated if they exceed time limits<\/li>\n                <li><strong>Test locally first<\/strong>: Debug on your local machine before using cluster resources<\/li>\n            <\/ul>\n        <\/section>\n        \n        <section class=\"card\">\n            <h3>Troubleshooting<\/h3>\n            \n            <div class=\"guide-section\">\n                <h4>Common Issues<\/h4>\n                <ul>\n                    <li><strong>Job stays in pending state<\/strong>: Resources may not be available &#8211; check with <code>squeue<\/code><\/li>\n                    <li><strong>Out of memory errors<\/strong>: Request more memory with <code>--mem<\/code><\/li>\n                    <li><strong>Job not using GPU<\/strong>: Verify your code is properly configured for GPU usage<\/li>\n                    <li><strong>Python errors not showing<\/strong>: Use <code>python -u<\/code> for unbuffered output<\/li>\n                    <li><strong>Job terminated unexpectedly<\/strong>: Check <code>scontrol show job JOBID<\/code> for details<\/li>\n                <\/ul>\n            <\/div>\n            \n            <div class=\"guide-section\">\n                <h4>Useful Commands<\/h4>\n                <div class=\"resource-table\">\n                    <table>\n                        <thead>\n                            <tr>\n                                <th>Command<\/th>\n                                <th>Description<\/th>\n                            <\/tr>\n                        <\/thead>\n                        <tbody>\n                            <tr>\n                                <td><code>sbatch script.sh<\/code><\/td>\n                                <td>Submit a job<\/td>\n                            <\/tr>\n                            <tr>\n                                <td><code>squeue<\/code><\/td>\n                                <td>View all jobs in queue<\/td>\n                            <\/tr>\n                            <tr>\n                                <td><code>squeue -u username<\/code><\/td>\n                                <td>View your jobs only<\/td>\n                            <\/tr>\n                            <tr>\n                                <td><code>scancel JOBID<\/code><\/td>\n                                <td>Cancel a specific job<\/td>\n                            <\/tr>\n                            <tr>\n                                <td><code>scancel -u username<\/code><\/td>\n                                <td>Cancel all your jobs<\/td>\n                            <\/tr>\n                            <tr>\n                                <td><code>sinfo<\/code><\/td>\n                                <td>View partition and node information<\/td>\n                            <\/tr>\n                            <tr>\n                                <td><code>scontrol show job JOBID<\/code><\/td>\n                                <td>View detailed information about a job<\/td>\n                            <\/tr>\n                            <tr>\n                                <td><code>scontrol show node nodename<\/code><\/td>\n                                <td>View detailed information about a node<\/td>\n                            <\/tr>\n                        <\/tbody>\n                    <\/table>\n                <\/div>\n            <\/div>\n        <\/section>\n    <\/template>\n<\/body>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>HPC Services SLURM Guide Last Update: Loading&#8230; Overview Access Guide SLURM Guide Cluster Dashboard Leased Resources Checking access to HPC resources&#8230; Connecting to HPC monitoring system&#8230; TU Dublin HPC Monitor &#8211; Internal Use Only Introduction to SLURM SLURM (Simple Linux Utility for Resource Management) is a workload manager used in the TU Dublin HPC cluster. [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"templates\/hpc-fullwidth-template.php","meta":{"zakra_sidebar_layout":"customizer","zakra_remove_content_margin":false,"zakra_sidebar":"customizer","zakra_transparent_header":"customizer","zakra_logo":0,"zakra_main_header_style":"default","zakra_menu_item_color":"","zakra_menu_item_hover_color":"","zakra_menu_item_active_color":"","zakra_menu_active_style":"","zakra_page_header":true,"footnotes":""},"class_list":["post-1432","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/ascnet.ie\/HPC_Research\/wp-json\/wp\/v2\/pages\/1432","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ascnet.ie\/HPC_Research\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/ascnet.ie\/HPC_Research\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/ascnet.ie\/HPC_Research\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/ascnet.ie\/HPC_Research\/wp-json\/wp\/v2\/comments?post=1432"}],"version-history":[{"count":4,"href":"https:\/\/ascnet.ie\/HPC_Research\/wp-json\/wp\/v2\/pages\/1432\/revisions"}],"predecessor-version":[{"id":1475,"href":"https:\/\/ascnet.ie\/HPC_Research\/wp-json\/wp\/v2\/pages\/1432\/revisions\/1475"}],"wp:attachment":[{"href":"https:\/\/ascnet.ie\/HPC_Research\/wp-json\/wp\/v2\/media?parent=1432"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}