Search

link to homepage

Institute for Advanced Simulation (IAS)

Navigation and service


Quick Introduction

JURECA Usage Model

JURECA is accessed through a dedicated set of login nodes used to write and compile applications as well as to perform pre- and post-processing of simulation data. Access to the compute and visualization nodes in the system is controlled by the workload manager.

Data Management

The available parallel filesystems on JURECA are mounted from JUST. JUQUEEN users should be aware that they will work in the same $HOME or $WORK directories as on these production machines. The same filesystems are available on JUDAC data gateway which is the recommended system for initiating transfers of large files.

Please note that the bandwidth of the connection between the JURECA login nodes and external networks has not yet reached the final stage of expansion. For the transfer of large files to or from the GPFS filesystems we currently recommend the.

JURECA Hardware Overview

TypeQuantityDescription
Standard / Slim1605 24 cores, 128 GiB
Fat (type 1)128 24 cores, 256 GiB
Fat (type 2)64 24 cores, 512 GiB
Accelerated75 24 cores, 128 GiB, 2x K80
Login12 24 cores, 256 GiB
Visualization (type 1)10 24 cores, 512 GiB, 2x K40
Visualization (type 2)2 24 cores, 1 TiB, 2x K40
Booster (KNL)164068 cores, 96 GiB

Access to JURECA

See here for details on how to log on to the system.

Software on JURECA

For the usage of the module command, compilers and pre-installed applications, please see the Software page.

Compile with the Intel Compilers

See here for details on how to compile Fortran, C or C++ programs with the Intel compilers.

Batch System on JURECA

The batch system on JURECA is the Slurm (Simple Linux Utility for Resource Management) Workload Manager, a free open-source batch system. The resource management on JURECA is not performed by Slurm but by the proven Parastation process management daemon already succesfully employed on JUROPA. The workload management system resulting from the integration of Slurm and the Parastation management toolbox developed by the project partner ParTec with JSC has reached production quality but has not been finalized yet. The final goal of this work is the development of a Slurm/Parastation package that combines the advantages of Slurm and of the Parastation cluster management while providing a Slurm-conforming look and feel to the batch system user.

Available Partitions

Compute nodes are used exclusively by jobs of a single user; no node sharing between jobs is done. The smallest allocation unit is one node (24 processors). Users will be charged for the number of compute nodes multiplied with the wall-clock time used.

On each node, a share of the available memory is reserved and not available for application usage. Batch jobs are guaranteed to have at minimum of 4 GiB, 8 GiB and 16 GiB per core available (96 GiB, 192 GiB and 384 GiB in total) on nodes with 128 GiB, 256 GiB and 512 GiB nodes, respectively.

In contrast to JUROPA on JURECA batch and interactive jobs can be used interchangeably and all limits apply to both use cases. The default batch partition is intended for production jobs and encompasses nodes with 128 GiB and 256 GiB main memory. The mem512 partition contains 64 nodes with 512 GiB main memory each. The gpus and vis partition provide access to GPU-equipped compute and visualization nodes. The GPU-equipped large memory nodes (1 TiB main memory) are accessible through the mem1024 as well as the vis partition. To support development and optimization of single node performance an additional devel partition is available. For software development and optimization efforts targeted at GPU-equipped compute nodes the develgpus partition can be used. The booster partition contains 1520 KNL nodes which are confiugured with Quadrant NUMA Mode with Hydrid50 Memory mode. The 24 KNL nodes in develbooster partition are configured in the same way but they are meant to be used for software development, small and short tests and also compilation of applications meant for the KNL architecture. The purpose of modetestbooster partition is for testing different KNL configurations and includes 96 KNL nodes which are divided into 3 different groups with 32 nodes each.

PartitionResourceValue
develmax. wallclock time 2 h
default wallclock time 30 min
min. / max. number of nodes1 / 8
max. no. of running jobs4
node typesmem128 (128 GiB)
develgpusmax. wallclock time 2 h
default wallclock time 30 min
min. / max. number of nodes1 / 2
max. no. of running jobs2
node typesmem128,gpu:[1-4]
(128 GiB, 2x K80 per node)
batchmax. wallclock time (normal / nocont)1 d / 6 h
default wallclock time1 h
min. / max. number of nodes1 / 256
node typesmem128 (128 GiB) and
mem256 (256 GiB)
mem256max. wallclock time (normal / nocont)1 d / 6 h
default wallclock time1 h
min. / max. number of nodes1 / 128
node typesmem256 (256 GiB)
mem512max. wallclock time (normal / nocont)1 d / 6 h
default wallclock time1 h
min. / max. number of nodes1 / 32
node typesmem512 (512 GiB)
mem1024max. wallclock time (normal / nocont)1 d / 6 h
default wallclock time1 h
node typesmem1024 (1 TiB)
gpusmax. wallclock time (normal / nocont)1 d / 6 h
default wallclock time1 h
min. / max. number of nodes1 / 32
node typesmem128,gpu:[1-4]
(128 GiB, 2x K80 per node)
vismax. wallclock time (normal / nocont)1 d / 6 h
default wallclock time1 h
min. / max. number of nodes4
node typesmem512,gpu:[1-2]
(512 GiB, 2x K40 per node) and
mem1024,gpu:[1-2]
(1 TiB, 2x K40 per node)
boostermax. wallclock time (normal / nocont)1 d / 6 h
default wallclock time1 h
min. / max. number of nodes1 / 64
node typesmem96 with feature quadhybrid
develboostermax. wallclock time6 h
default wallclock time30 mins
min. / max. number of nodes1 / 8
node typesmem96 with feature quadhybrid
modetestboostermax. wallclock time (normal / nocont)1 d / 6 h
default wallclock time1 h
min. / max. number of nodes1 / 32
node types

mem96 with features: snc4flat or

snc4cache or quadcache

A limit regarding the maximum number of running jobs per user is enforced. The precise values are adjusted to optimize system utilization. In general, the limit for the number of running jobs is lower for nocont projects.

In addition to the above mentioned partitions the large partition is available for large and full-system jobs. The partition is open for submission but jobs will only run at in selected timeslots. The max. wallclock time is not limited but jobs with wallclock time above 30 minutes need to coordinated with the user support.

In order to request nodes with particular resources (mem256, mem512, mem1024, gpu) generic resources need to be requested at job submission. Please note that the mem256, mem512, mem1024, gpus, and develgpus partitions may only be used for applications requiring large memory or using GPU accelerators, respectively.

The modetestbooster partition is divided into 3 different groups of 32 nodes each. Currently the KNL configurations are: a) SNC4 + Flat, b) SNC4 + Cache and c) Quadrant + Cache. In Slurm those groups have been configured to have different "Features", and the current list of Features is: a) snc4flat, b) snc4cache and c) quadcache. In order to use certain configuration users must apply also the constraint submission option: "-C, --constraint=<feature>". For example: "sbatch ... -p modetestbooster -C snc4flat ...". Submissions in modetestbooster partition are denied when users do not specify any Feature.

Writing a Batch Script

Users submit batch applications (usually bash scripts) using the sbatch command. The script is executed on the first compute node in the allocation. To execute parallel MPI tasks users call srun within their script.
Please note that mpiexec is not supported on JURECA and has to be replaced by srun.


The minimal template to be filled is:

#!/bin/bash -x
#SBATCH --nodes=<no of nodes>
#SBATCH --ntasks=<no of tasks (MPI processes)>
# can be omitted if --nodes and --ntasks-per-node
# are given
#SBATCH --ntasks-per-node=<no of tasks per node>
# if keyword omitted: Max. 48 tasks per node
# (SMT enabled, see comment below)
#SBATCH --cpus-per-task=<no of threads per task>
# for OpenMP/hybrid jobs only
#SBATCH --output=<path of output file>
# if keyword omitted: Default is slurm-%j.out in
# the submission directory (%j is replaced by
# the job ID).
#SBATCH --error=<path of error file>
# if keyword omitted: Default is slurm-%j.out in
# the submission directory.
#SBATCH --time=<walltime>
#SBATCH --partition=<batch, mem512, ...>

# *** start of job script ***
# Note: The current working directory at this point is
# the directory where sbatch was executed.

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
srun <executable>

Multiple srun calls can be placed in a single batch script. Options such as --nodes, --ntasks and --ntasks-per-node are by default taken from the sbatch arguments but can be overwritten for each srun invocation. The default partition on JURECA, which is used if --partition is omitted, is the batch partition.

Note: If --nasks-per-node is omitted or set to a value higher than 24 SMT (simultaneous multithreading) will be enabled. Each compute node has 24 physical cores and 48 logical cores.

Job Script Examples

Example 1: MPI application starting 96 tasks on 4 nodes using 24 CPUs per node (no SMT) running for max. 15 minutes.

#!/bin/bash -x
#SBATCH --nodes=4
#SBATCH --ntasks=96
#SBATCH --ntasks-per-node=24
#SBATCH --output=mpi-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=00:15:00
#SBATCH --partition=batch

srun ./mpi-prog

Example 2: MPI application starting 1536 tasks on 32 nodes using 48 logical CPUs (hardware threads) per node (SMT enabled) running for max. 20 minutes.

#!/bin/bash -x
#SBATCH --nodes=32
#SBATCH --ntasks=1536
#SBATCH --ntasks-per-node=48
#SBATCH --output=mpi-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=00:20:00
#SBATCH --partition=batch

srun ./mpi-prog

Example 3: Hybrid application starting 3 tasks per node on 4 allocated nodes and starting 7 threads per node (no SMT).

#!/bin/bash -x
#SBATCH --nodes=4
#SBATCH --ntasks=12
#SBATCH --ntasks-per-node=3
#SBATCH --cpus-per-task=7
#SBATCH --output=mpi-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=00:20:00
#SBATCH --partition=batch

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
srun ./hybrid-prog

Example 4: Hybrid application starting 4 tasks per node on 3 allocated nodes and starting 12 threads per task (SMT enabled).

#!/bin/bash -x
#SBATCH --nodes=3
#SBATCH --ntasks=12
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=12
#SBATCH --output=mpi-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=00:20:00
#SBATCH --partition=batch

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
srun ./hybrid-prog

Example 5: MPI application starting 96 tasks on 4 nodes using 24 CPUs per node (no SMT) running for max. 15 minutes on nodes with 256 GiB main memory. This example is identical to Example 1 except for the requested node type.

#!/bin/bash -x
#SBATCH --nodes=4
#SBATCH --ntasks=96
#SBATCH --ntasks-per-node=24
#SBATCH --output=mpi-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=00:15:00
#SBATCH --partition=batch
#SBATCH --gres=mem256

srun ./mpi-prog

The job script is submitted using

sbatch <jobscript>

On success, sbatch writes the job ID to standard out.

Note: One can also define sbatch options on the command line, e.g.,

sbatch --nodes=4 --time=01:00:00 <jobscript>

Requesting Generic Resources

In order to request resources with special features (additional main memory, GPU devices) the --gres option to sbatch can be used. For mem256 and mem512 nodes, which are accessible via specific partitions, the --gres option can be omitted. Since the GPU and visualization nodes feature multiple user-visible GPU devices (4 on GPU compute nodes and 2 on visualization nodes) an additional quantity needs to be specified as shown in the following examples.

OptionRequested hardware features
--partition=mem256256 GiB main memory
--gres=mem512 --partition=mem512512 GiB main memory
--gres=gpu:2 --partition=gpus2 GPUs per node
--gres=gpu:4 --partition=gpus4 GPUs per node
--gres=mem1024 --partition=mem10241 TiB main memory
--gres=mem1024,gpu:2 --partition=vis1 TiB main memory and 2 GPUs per node

If no specific memory size is requested the default --gres=mem128 is automatically added to the submission. Please note that jobs requesting 128 GiB may also run on nodes with 256 GiB if no other free resources are available.
The nodes equipped with 1 TiB main memory are accessible through the mem1024 and vis partition depending on the intended use case.

The vis, gpus and develgpus partitions will reject submissions if the corresponding resources are not requested. Please note that GPU applications will only be able to use as many GPUs per node as requested via --gres=gpu:n where n can be 1, 2, 3 or 4 on GPU compute nodes and 1 or 2 on visualization nodes. Please refer to the GPU computing page for examples.

Interactive Sessions

Interactive sessions can be allocated using the salloc command

salloc --partition=devel --nodes=2 --time=00:30:00

Once an allocation has been made salloc will start a bash on the login node (submission host). One can then execute srun from within the bash, e.g.,

srun --ntasks=4 --ntasks-per-node=2 --cpus-per-task=7 ./hybrid-prog

The interactive session is terminated by exiting the shell. In order to obtain a shell on the first allocated compute nodes one can start a remote shell from within the salloc session and connect it to a pseudo terminal using

srun --cpu_bind=none --nodes=2 --pty /bin/bash -i

The option --cpu_bind=none is used to disable CPU binding for the spawned shell. In order to execute MPI application one uses srun again from the remote shell. To support X11 forwarding the --forward-x option to srun is available. X11 forwarding is required for users who want to use applications or tools with provide a GUI.

Note: Your account will be charged per allocation whether the compute nodes are used or not. Batch submission is the preferred way to execute jobs.

Other Useful sbatch and srun Options

  • To receive e-mail notification users have to specify --mail-user=<e-mail address> and set --mail-type=<type> with valid types: BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT, TIME_LIMIT_90, TIME_LIMIT_80, TIME_LIMIT_50, ARRAY_TASKS to receive emails when events occur. Multiple type values may be specified in a comma separated list.
  • Stdout and stderr can be combined by specifying the same file for the --output and --error option.
  • A job name can be specified using the --job-name option.
  • If --ntasks is omitted the number of nodes can be specified as a range --nodes=<min no. of nodes>-<max no. of nodes> allowing the scheduler to start the job with fewer nodes nodes than the maximum
    requested if this reduces wait time.

Summary of sbatch and srun Options

The following table summarizes important sbatch and srun command options:

OptionDescription
--nodesNumber of compute nodes used by the job. Can be omitted if --ntasks and --ntasks-per-node is given.
--ntasksNumber of tasks (MPI processes). Can be omitted if --nnodes and --ntasks-per-node is given.
--ntasks-per-nodeNumber of tasks per compute nodes.
--cpus-per-taskNumber of logical CPUs (hardware threads) per task. This option is only relevant for hybrid/OpenMP jobs.
--outputPath to the job's standard output. Slurm supports format strings containing replacement symbols such as %j (job ID).
--errorPath to the job's standard error. Slurm supports format strings containing replacement symbols such as %j (job ID).
--timeMaximal wall-clock time of the job.
--partitionPartition to be used. The argument can be either batch or large on JURECA. If omitted, batch is the default.
--mail-userDefine the mail address to receive mail notification.
--mail-typeDefine when to send a mail notifications.
--pty (srun only)Execute the first task in pseudo terminal mode.
--forward-x (srun)Enable X11 forwarding on the first allocated node.

More information is available on the man pages of sbatch, srun and salloc which can be retrieved on the login nodes with the commands man sbatch, man srun and man salloc, respectively.

Other Slurm Commands

CommandDescription
squeueShow status of all jobs.
scancel <jobid>Cancel a job.
scontrol show job <jobid>Show detailed information about a pending, running or recently completed job.
scontrol update job <jobid> set ...Update a pending job.
scontrol -hShow detailed information about scontrol.
sacct -j <jobid>Query information about old jobs.
sprioShow job priorities.
smapShow distribution of jobs. For a graphical interface users are referred to llview.
sinfoView information about nodes and partitions.

For further information please see also the Slurm documentation.


Servicemeu

Homepage