Difference between revisions of "Code:SLURM"
(Created page with "== User environment == Zeroth, set up your own system (for linux), by adding a host def for circe to your '''$HOME/.ssh/config''' file. Then you don't have to keep typing in cir…") |
|||
Line 42: | Line 42: | ||
#SBATCH --mem=2000 |
#SBATCH --mem=2000 |
||
− | These will specify the name of the output log file instead of the default () |
+ | These will specify the name of the output log file instead of the default (names auto-generated from the job number, e.g. '''slurm-22425.out'''). By default, the log files include both standard output and standard error from the job. |
Submit with |
Submit with |
||
Line 49: | Line 49: | ||
</source> |
</source> |
||
− | The '''-p saturn''' |
+ | The '''-p saturn''' selects the default queue and can be left out. Other possible values select different node partitions, and are: |
* "jupiter": 444 cores; preemptable by "deadline" QOS (currently inactive). |
* "jupiter": 444 cores; preemptable by "deadline" QOS (currently inactive). |
||
− | * "saturn": 280 cores; default; preemptable by "deadline" QOS |
+ | * "saturn": 280 cores; default; preemptable by "deadline" QOS (currently inactive). |
− | (currently inactive). |
||
* "neptune": 168 cores; preemptable by "deadline" QOS (currently inactive). |
* "neptune": 168 cores; preemptable by "deadline" QOS (currently inactive). |
||
− | * "hii_broad": 80 cores; testing "contributor" hardware pool; |
+ | * "hii_broad": 80 cores; testing "contributor" hardware pool; preemptable by "hii_broad" QOS (active). |
− | preemptable by "hii_broad" QOS (active). |
||
* "titan": 16 cores; no preemption; 128 GB RAM for large memory jobs. |
* "titan": 16 cores; no preemption; 128 GB RAM for large memory jobs. |
||
* "pluto": 8 cores; no preemption. |
* "pluto": 8 cores; no preemption. |
Latest revision as of 14:38, 13 October 2014
User environment
Zeroth, set up your own system (for linux), by adding a host def for circe to your $HOME/.ssh/config file. Then you don't have to keep typing in circe's full path and your username when running ssh from linux.
Host rc User <username on circe> Hostname rcslurm.rc.usf.edu ServerAliveInterval 30 ServerAliveCountMax 120 ForwardX11 yes
You may need:
mkdir -p $HOME/.ssh && chmod 700 $HOME/.ssh vi $HOME/.ssh/config # ':q' to exit chmod 600 $HOME/.ssh/config
The slurm job status command is squeue. A helpful alias to monitor your own jobs is,
alias myq="squeue -u $USER"
Queue
Here's a basic template for queuing a job using MPI (here 2 whole nodes and 6h max run-time).
<source lang="bash">
- !/bin/bash
- SBATCH -J test
- SBATCH -N 2 -t 6:00:00
module load mpi/openmpi/1.4.5 compilers/intel/11.1.064
start=`date +%s` mpirun parallel-executable end=`date +%s` echo "Job completed in $((end-start)) seconds." </source> By default, slurm jobs start in the same directory that sbatch was invoked.
A few more useful options are:
#SBATCH -o output_log_name.log #SBATCH --mem=2000
These will specify the name of the output log file instead of the default (names auto-generated from the job number, e.g. slurm-22425.out). By default, the log files include both standard output and standard error from the job.
Submit with <source lang="bash"> sbatch -p saturn job.sh </source>
The -p saturn selects the default queue and can be left out. Other possible values select different node partitions, and are:
- "jupiter": 444 cores; preemptable by "deadline" QOS (currently inactive).
- "saturn": 280 cores; default; preemptable by "deadline" QOS (currently inactive).
- "neptune": 168 cores; preemptable by "deadline" QOS (currently inactive).
- "hii_broad": 80 cores; testing "contributor" hardware pool; preemptable by "hii_broad" QOS (active).
- "titan": 16 cores; no preemption; 128 GB RAM for large memory jobs.
- "pluto": 8 cores; no preemption.
For execution and status, use squeue, or the myq command, defined above as an alias in .bashrc.
For more info, see LLNL's Slurm Quickstart Guide.
Here's a run-down on some of the environment variables available during running (for scripting) see sbatch's manpage for more:
- SLURM_JOB_NAME - Name of the job.
- SLURM_JOB_ID - The ID of the job allocation.
- SLURM_CPUS_ON_NODE - Number of CPUS on the allocated node.
- SLURM_JOB_NODELIST - List of nodes allocated to the job in a compressed format.
- SLURM_JOB_NUM_NODES - Total number of nodes in the job’s resource allocation.
- SLURM_JOB_CPUS_PER_NODE - Count of processors available to the job on this node.
- SLURM_SUBMIT_DIR - The directory from which sbatch was invoked.
- SLURM_JOB_PARTITION - Name of the partition in which the job is running.
- SLURM_LOCALID - Node local task ID for the process within a job.
- SLURM_GTIDS - Global task IDs running on this node. Zero origin and comma separated.