User Tools

Site Tools


guides:slurm

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
guides:slurm [21.09.2019 14:26]
Teemu Kuulasmaa created
guides:slurm [15.11.2021 16:19] (current)
Administrator
Line 1: Line 1:
-SLURM Workload Manager+Slurm Workload Manager
  
-[SLURM Workload Manager](https://slurm.schedmd.com/overview.html) is an open source [Job scheduler](https://en.wikipedia.org/wiki/Job_scheduler) that is intended to control background executed programs. These background executed programs are called **Jobs**. User defines the Job with various parameters that include run time, number of tasks (CPU cores), amount of required memory (RAM) and specify which program(s) to execute. These jobs are called batch jobs. (Batch) Jobs are submitted to common job queue (partition) that is shared by the other users and SLURM will execute the submitted jobs automatically in turn. After the job is completed (or error occurs) SLURM can optionally notify the user with email notification. Additionally to the batch jobs user can reserve compute node for interactive jobs where you wait for your turn in queue and on your turn you are put on your reserved node where you can execute commands. After the reserved time is over your sessions is terminated.+[Slurm Workload Manager](https://slurm.schedmd.com/overview.html) is an open source [Job scheduler](https://en.wikipedia.org/wiki/Job_scheduler) that is intended to control background executed programs. These background executed programs are called **Jobs**. The user specifies the Job with various parameters that include run time, number of tasks, number or required CPU cores, amount of required memory (RAM) and specify which program(s) to execute. These jobs are called batch jobs. (Batch) Jobs are submitted to common job queue (partition) that is shared by the other users and Slurm will execute the submitted jobs automatically in turn. After the job is completed (or an error occurs) Slurm can optionally notify the user with an email notification. Additionally to the batch jobs the user can reserve compute node for interactive jobs where you wait for your turn in queue and on your turn you are put on your reserved node where you can execute commands. After the reserved time is over your sessions is terminated.
  
-**SLURM Partitions on sampo.uef.fi:**+{{:guides:slurm:slurm.png}}
  
-- **serial**4 out of 4 nodes. Maximum run time 3 days +## Slurm Partitions on sampo.uef.fi
-- **longrun**. 2 out of 4 nodes. Maximum run time 14 days +
-- **parallel**. 2 of 4 nodes. Maximum run time 3 days.+
  
-**Explanation of the partitions:**+**serial**. 4 out of 6 nodes. Maximum run time 3 days 
 +- **longrun**. 2 out of 6 nodes. Maximum run time 14 days 
 +- **parallel**. 2 of 6 nodes. Maximum run time 3 days. 
 +- **gpu**. 2 nodes. Maximum run time 3 days.
  
-Compute nodes are grouped in multiple partitions and each partition can be considered as a job queue. Partitions can have multiple constraints and restrictions. For example access for certain partitions can be limited by the user/group or the maximum running time can restricted.+## Explanation of the partitions
  
-**Serial** partition is the default partition for all jobs that user submits. User can reserve maximum of 1 nodes for his/her job. Default run time is 5 minutes and maximum 3 days.+Compute nodes are grouped into multiple partitions and each partition can be considered as a job queue. Partitions can have multiple constraints and restrictions. For example access to certain partitions can be limited by the user/group or the maximum running time can restricted. 
 + 
 +**Serial** partition is the default partition for all jobs that the user submits. The user can reserve maximum of 1 nodes for his/her job. Default run time is 5 minutes and maximum 3 days.
  
 **Longrun** partition is for long running jobs and only one node is for this usage. Default run time 5 minutes and maximum 14 days. **Longrun** partition is for long running jobs and only one node is for this usage. Default run time 5 minutes and maximum 14 days.
  
-**Parallel** partition is for parallel jobs that can span over multiple nodes (MPI jobs for example). User can reserve 2 nodes (minimum and maximum). Default run time is 5 minutes and maximum days. +**Parallel** partition is for parallel jobs that can span over multiple nodes (MPI jobs for example). The user can reserve 2 nodes (minimum and maximum). Default run time is 5 minutes and maximum days.
- +
-** Using R with SLURM ** +
- +
-Example script (**hello.R**): +
-``` +
-sayHello <- function(){ +
-  print("hello"+
-+
-sayHello() +
-``` +
- +
-User can execute R scripts from the command line with the following commands: +
- +
-1. R CMD BATCH script.R +
-2. Rscript script.R +
- +
-Note: With the **R CMD BATCH** command the output of the R script is redirected to file instead of the screen +
- +
-Next user must embed the script to the **SLURM batch job file/control file** (**submit.sbatch**): +
- +
-``` +
- +
-#!/bin/bash +
-#SBATCH --job-name helloworld # Name for your job +
-#SBATCH --ntasks 1 # Number of task +
-#SBATCH --time 5 # Runtime in minutes. +
-#SBATCH --mem=2000 # Reserve 2 GB RAM for the job +
-#SBATCH --partition serial # Partition to submit +
-#SBATCH --output hello.out # Standard out goes to this file +
-#SBATCH --error hello.err # Standard err goes to this file +
-#SBATCH --mail-user username@uef.fi # this is the email you wish to be notified at +
-#SBATCH --mail-type ALL # ALL will alert you of job beginning, completion, failure etc +
- +
-module load r # load modules +
- +
-Rscript hello.R # Execute the script +
- +
-``` +
- +
-User can submit the job to the compute queue with the **[sbatch](https://slurm.schedmd.com/sbatch.html)** command. Note that the batch file (and R script and data) must be located at the /home/ disk. +
- +
-``` +
-sbatch submit.sbatch +
-``` +
- +
-User can monitor the progress of the job with the **[squeue](https://slurm.schedmd.com/squeue.html)** command. JOBID is provided by the sbatch commmand when the job is submitted. +
- +
-``` +
-squeue -j JOBID +
-``` +
- +
-Also while the job is running user can login to executing compute node with the ssh command. When job is over the ssh session is terminated. +
- +
-``` +
-ssh sampo1 +
-``` +
- +
-** Interactive session ** +
- +
-User can get an interactive sessions for whatever purpose. For this to be effective free node is more or less required. Following command will open bash session to any free node on the serial parallel for the next 5 minutes. +
- +
-``` +
-srun -p serial --pty -t 0-00:05 /bin/bash +
-``` +
- +
-** Slurm job efficiency report (seff) and Accounting ** +
- +
-SLURM can provide the user with various job statistics. Like memory usage and CPU time. +
-for example with seff (Slurm job effiency report) it is possible to monitor on how efficiency the job was. +
- +
-``` +
-seff JOBID +
-``` +
- +
-It is particularly useful to add following line to the end of the sbatch script:+
  
-``` +**GPU** partition is for gpu jobs (CUDA jobs). The user can reserve 2 nodes with 8xNVIDIA A100/40 GB. Default run time is 5 minutes and maximum 3 days.
-seff $SLURM_JOBID +
-```+
  
-or if you wish to have more detailed information 
  
-``` 
-# show all own jobs contained in the accounting database 
-sacct 
-# show specific job 
-sacct -j JOBID 
-# specify fields 
-sacct -j JOBID -o JobName,MaxRSS,MaxVMSize,CPUTime,ConsumedEnergy 
-# show all fields 
-sacct -j JOBID -o ALL 
-``` 
guides/slurm.1569065209.txt.gz · Last modified: 29.10.2019 15:10 (external edit)