Difference between revisions of "Code:SLURM"

From Predictive Chemistry
Jump to: navigation, search
(Created page with "== User environment == Zeroth, set up your own system (for linux), by adding a host def for circe to your '''$HOME/.ssh/config''' file. Then you don't have to keep typing in cir…")
 
 
Line 42: Line 42:
 
#SBATCH --mem=2000
 
#SBATCH --mem=2000
   
These will specify the name of the output log file instead of the default ()
+
These will specify the name of the output log file instead of the default (names auto-generated from the job number, e.g. '''slurm-22425.out'''). By default, the log files include both standard output and standard error from the job.
   
 
Submit with
 
Submit with
Line 49: Line 49:
 
</source>
 
</source>
   
The '''-p saturn''' is the default and can be left out. Other possible values select different node partitions, and are:
+
The '''-p saturn''' selects the default queue and can be left out. Other possible values select different node partitions, and are:
 
* "jupiter": 444 cores; preemptable by "deadline" QOS (currently inactive).
 
* "jupiter": 444 cores; preemptable by "deadline" QOS (currently inactive).
* "saturn": 280 cores; default; preemptable by "deadline" QOS
+
* "saturn": 280 cores; default; preemptable by "deadline" QOS (currently inactive).
(currently inactive).
 
 
* "neptune": 168 cores; preemptable by "deadline" QOS (currently inactive).
 
* "neptune": 168 cores; preemptable by "deadline" QOS (currently inactive).
* "hii_broad": 80 cores; testing "contributor" hardware pool;
+
* "hii_broad": 80 cores; testing "contributor" hardware pool; preemptable by "hii_broad" QOS (active).
preemptable by "hii_broad" QOS (active).
 
 
* "titan": 16 cores; no preemption; 128 GB RAM for large memory jobs.
 
* "titan": 16 cores; no preemption; 128 GB RAM for large memory jobs.
 
* "pluto": 8 cores; no preemption.
 
* "pluto": 8 cores; no preemption.

Latest revision as of 14:38, 13 October 2014

User environment

Zeroth, set up your own system (for linux), by adding a host def for circe to your $HOME/.ssh/config file. Then you don't have to keep typing in circe's full path and your username when running ssh from linux.

Host rc
  User <username on circe>
  Hostname rcslurm.rc.usf.edu
  ServerAliveInterval 30
  ServerAliveCountMax 120
  ForwardX11 yes

You may need:

 mkdir -p $HOME/.ssh && chmod 700 $HOME/.ssh
 vi $HOME/.ssh/config # ':q' to exit
 chmod 600 $HOME/.ssh/config

The slurm job status command is squeue. A helpful alias to monitor your own jobs is,

 alias myq="squeue -u $USER"

Queue

Here's a basic template for queuing a job using MPI (here 2 whole nodes and 6h max run-time).

<source lang="bash">

  1. !/bin/bash
  2. SBATCH -J test
  3. SBATCH -N 2 -t 6:00:00

module load mpi/openmpi/1.4.5 compilers/intel/11.1.064

start=`date +%s` mpirun parallel-executable end=`date +%s` echo "Job completed in $((end-start)) seconds." </source> By default, slurm jobs start in the same directory that sbatch was invoked.

A few more useful options are:

#SBATCH -o output_log_name.log
#SBATCH --mem=2000

These will specify the name of the output log file instead of the default (names auto-generated from the job number, e.g. slurm-22425.out). By default, the log files include both standard output and standard error from the job.

Submit with <source lang="bash"> sbatch -p saturn job.sh </source>

The -p saturn selects the default queue and can be left out. Other possible values select different node partitions, and are:

  • "jupiter": 444 cores; preemptable by "deadline" QOS (currently inactive).
  • "saturn": 280 cores; default; preemptable by "deadline" QOS (currently inactive).
  • "neptune": 168 cores; preemptable by "deadline" QOS (currently inactive).
  • "hii_broad": 80 cores; testing "contributor" hardware pool; preemptable by "hii_broad" QOS (active).
  • "titan": 16 cores; no preemption; 128 GB RAM for large memory jobs.
  • "pluto": 8 cores; no preemption.

For execution and status, use squeue, or the myq command, defined above as an alias in .bashrc.

For more info, see LLNL's Slurm Quickstart Guide.

Here's a run-down on some of the environment variables available during running (for scripting) see sbatch's manpage for more:

  • SLURM_JOB_NAME - Name of the job.
  • SLURM_JOB_ID - The ID of the job allocation.
  • SLURM_CPUS_ON_NODE - Number of CPUS on the allocated node.
  • SLURM_JOB_NODELIST - List of nodes allocated to the job in a compressed format.
  • SLURM_JOB_NUM_NODES - Total number of nodes in the job’s resource allocation.
  • SLURM_JOB_CPUS_PER_NODE - Count of processors available to the job on this node.
  • SLURM_SUBMIT_DIR - The directory from which sbatch was invoked.
  • SLURM_JOB_PARTITION - Name of the partition in which the job is running.
  • SLURM_LOCALID - Node local task ID for the process within a job.
  • SLURM_GTIDS - Global task IDs running on this node. Zero origin and comma separated.