# Error messages on JUGENE

Error message ‘import site' failed; use -v for traceback

The error message

'import site' failed; use -v for traceback

might occur when using python in a batch job. This is due to the fact that the variable $HOME is not set on the I/O nodes of JUGENE. The error message can be avoided if$HOME is passed to the mpirun command:

mpirun -exp_env HOME ...

Environment variable list shouldn't be longer than 2049 chars

If a program execution aborts with the error message

Environment variable list for the job is .... characters
It shouldn't be longer than 2049 chars - Aborting.

mpirun -exp_env VARNAME1 [ -exp_env VARNAME2 ]
to export the needed environment variables.

Application killed with signal 6 (executable too big)

Some users may find that some Blue Gene/P applications exit with a signal 6 and without any further error message. This may be caused by an executable, which is too big for the compute nodes, which have only 512 MB of memory per processor.

IBM works on giving out an error message in this situation in the next release of the BG/P software.

Application killed with signal 7 (memory alignment error)

Some users may find that some Blue Gene/P jobs exit with a segmentation fault and signal 7 (SIGBUS). This is due to an undocumented feature on the Blue Gene/P that deals with memory alignment during I/O operations. This feature allows the user to set a memory alignment error threshold for I/O operations in their job but results in confusion due to it's default value of 1000. This threshold can be controlled using an environment variable.

Die as soon as a memory alignment error is encountered during I/O:

-env BG_MAXALIGNEXP=1

Silence all memory alignment errors during I/O:

-env BG_MAXALIGNEXP=-1

Configure a threshold to workaround occasional memory alignment errors during I/O (eg. 2000 in this case):

-env BG_MAXALIGNEXP=2000

Note that memory alignment errors during I/O do have the potential to reduce I/O performance and as such it is better to eliminate them rather than ignore them if possible.

OSABI value of executable file is not UNIX System V

If a program execution aborts with the error message

Load failed on 134.94.xx.yy: OSABI value of executable file is not UNIX System V

the called executable might be one that was compiled for BlueGene/L.

The exit code of the LoadLeveler job is then 256.

Magic value in ELF header of executable file is invalid

If a program execution aborts with the error message

Load failed on 134.94.xx.yy: Magic value in ELF header of executable file is invalid

the called executable might be one that was compiled for AIX.

The exit code of the LoadLeveler job is then 256.

DMA related RAS events APPL_0A2B and APPL_0201 --> improve communication performance

When using the option -verbose 2 with the mpirun command users might find information about the RAS events APPL_0A2B and/or APPL_0201 in the stderr. These events are not critical (i.e. the application does not fail). However, the performance might be not optimal.

APPL_0A2B - A DMA unit reception FIFO is full

Packets that arrive off the network are placed into a reception buffer. The default size is 8 MB per process. If a process is busy and does not call MPI_Wait often enough, this buffer can become full and the movement of further packets is stopped until the buffer is free again. This can slow down the application. In this case the buffer should be increased to the value given in the RAS event message using the environment variable DCMF_RECFIFO (see Tuning Applications for further information).

APPL_0201 - A DMA unit remote get injection FIFO is full

When a remote get packet arrives off the network, the packet contains a descriptor describing data that is to be retrieved and sent back to the node that originated the remote get packet. The DMA injects that descriptor into a remote get injection buffer. The DMA then processes that injected descriptor by sending the data back to the originating node. Remote gets are commonly done during point-to-point communications when large data buffers are involved (typically larger than 1200 bytes). The default size of the remote get injection FIFO is 32 KB. When a large number of remote get packets are received by a node, the remote get injection buffer may become full of descriptors. The size of the buffer can be adjusted to the value given in the RAS event message using the environment variable DCMF_RGETFIFO (see Tuning Applications for further information).

Error message ‘Invalid communicator’ when using BLACS and MPI

When BLACS (Basic Linear Algebra Communication Subprograms) is used in FORTRAN together with MPI the routine BLACS_GRIDINIT expects among others an MPI communicator as input argument. Leaving the routine, this communicator is overwritten by an internal BLACS context number. This context number is in general not a valid MPI communicator anymore. Using it for MPI routines might eventually result in an error message like:

Fatal error in MPI_Attr_get: Invalid communicator, error stack

You should save the corresponding MPI communicator before calling BLACS_GRIDINIT to use it for further MPI calls in order to avoid this error.

Hint: This error might be difficult to detect. On a number of systems the MPI_COMM_WORLD communicator is set to 0 (zero). Incidentally, the BLACS context number is zero on many systems, too. Therefore, the code might run just fine on such systems, since the communicator and the context number have identical values. However, on the BG/P the MPI_COMM_WORLD is set to 0x44 and the code aborts.

How to determine the allocated memory of an application at runtime?

The available memory on a JUGENE node in VN mode is: 474 MB (with V1R4M2, June 2010)

# Determining memory size/usage in C

## Method 1:

#include < sys/resource.h>
#include < common/bgp_personality.h>
#include < common/bgp_personality_inlines.h>
#include < spi/kernel_interface.h&bt
static _BGP_Personality_t mybgp;

/* returns memory per core in MBytes */
unsigned bg_coreMB() {
unsigned procMB, coreMB;
Kernel_GetPersonality(&mybgp, sizeof(_BGP_Personality_t));
procMB = BGP_Personality_DDRSizeMB(&mybgp);
coreMB = procMB/Kernel_ProcessCount();
return coreMB;
}

/* return maximum memory usage of process in kBytes */
unsigned bg_usedKB() {
struct rusage usage;
if (getrusage(RUSAGE_SELF, &usage) != 0)
return 0;
return usage.ru_maxrss;
}

Compile with:

mpicc -O -I/bgsys/drivers/ppcfloor/arch/include

## Method 2:

#include < bgp_SPI.h>

uint32_t allocated_memory = 0;
Kernel_GetMemorySize(KERNEL_MEMSIZE_HEAP, &allocated_memory);

Kernel_GetMemorySize along with KERNEL_MEMSIZE_HEAP gives back the allocated memory on heap of the application at runtime.

## Example:

#include < mpi.h>
#include < stdlib.h>
#include < bgp_SPI.h>
#include < stdio.h>

const long MB = 1048576;

int main(int argc, char* argv[]) {
int myrank, nprocs;

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);

uint32_t allocated_memory = 0;
Kernel_GetMemorySize(KERNEL_MEMSIZE_HEAP, &allocated_memory);
printf ("Memory allocation on heap: %ld MB\n", allocated_memory/MB);

MPI_Finalize();
return 0;
}

Compile with:

mpixlc_r determine_memory_example.c -o determine_memory_example.x -I/bgsys/drivers/ppcfloor/arch/include/ -I/bgsys/drivers/ppcfloor/arch/include/spi

If you want to obtain the amount of memory allocated on the stack, then replace KERNEL_MEMSIZE_HEAP with KERNEL_MEMSIZE_STACK given in the example above.

# Determining memory size/usage in Fortran

## Example:

#include < mpi.h>
#include "bgp_SPI.h"
#include < sys/times.h>
#include < sys/resource.h>
#include < assert.h>
#include < stdio.h>
#include < errno.h>
 #include < string.h>

void print_memory_usage( void )
{

unsigned int memory_size = 0;

Kernel_GetMemorySize(KERNEL_MEMSIZE_HEAP, &memory_size);

fprintf(stderr, "Memory size HEAP: %10ld \n",
(long) memory_size);

Kernel_GetMemorySize(KERNEL_MEMSIZE_STACK, &memory_size);

fprintf(stderr, "Memory size STACK: %10ld \n",
(long) memory_size);

Kernel_GetMemorySize(KERNEL_MEMSIZE_HEAPAVAIL, &memory_size);

fprintf(stderr, "Memory available HEAP: %10ld \n",
(long) memory_size);

Kernel_GetMemorySize(KERNEL_MEMSIZE_STACKAVAIL, &memory_size);

fprintf(stderr, "Memory available STACK: %10ld \n",
(long) memory_size);

Kernel_GetMemorySize(KERNEL_MEMSIZE_HEAPMAX, &memory_size);

fprintf(stderr, "Maximum memory HEAP: %10ld \n",
(long) memory_size);

}

Compile with:

mpicc -c -I/bgsys/drivers/ppcfloor/arch/include/ -I/bgsys/drivers/ppcfloor/arch/include/spi memory_usage.c

In order to link against the SPI libraries the environment variable MPI_CCLIBS has to be set in the following way:

export MPI_CCLIBS=" -Wl,-rpath,/bgsys/drivers/V1R4M1_460_2009-091110P/ppc/runtime/SPI -lSPI.cna "

The code snippet given below shows how the amount of allocated memory can be obtained for a program at runtime:

 program memory_usage

 implicit none

 integer, parameter :: zeilen=1000
 integer, parameter :: spalten=10000
 integer :: i

 integer,dimension(10000000),parameter :: stack_test=999

 integer, dimension(:,:), allocatable :: matrix

 write(*,*)('=',i=1,60)
 call print_memory_usage()

 write(2, '(10i10)' )stack_test

 write(*,*)('=',i=1,60)
 call print_memory_usage()

 allocate(matrix(zeilen,spalten))
 matrix=-100
 write(1, '(10i10)' )matrix

 write(*,*)('=',i=1,60)
 call print_memory_usage()

 deallocate(matrix)

 write(*,*)('=',i=1,60)
 call print_memory_usage()
 write(*,*)('=',i=1,60)

 end program memory_usage

mpif90 get_memory.f90 memory_usage.o $MPI_CCLIBS # Environment variables In case memory runs low in your jobs you can try to increase the memory available for your application on the nodes by using one or more of the environment variables, described in Compiling and Tuning. How to do time measurements in Fortran and C ? ## Time measurement in Fortran: For timing measurements on JUGENE Fortran users should use either the Fortran90 intrinsic system_clock or MPI_WTIME. ## Time measurement in C: #include < common/bgp_personality.h> #include < common/bgp_personality_inlines.h> #include < spi/kernel_interface.h> static _BGP_Personality_t mybgp; static double clockspeed=0.0; /* initialize the clockspeed scaling factor */ unsigned bg_wtime_init() { unsigned clockMHz; Kernel_GetPersonality(&mybgp, sizeof(_BGP_Personality_t)); clockMHz = BGP_Personality_clockMHz(&mybgp); clockspeed = 1.0e-6/(double)clockMHz; return clockMHz; } /* return time in seconds */ double bg_wtime() { return ( _bgp_GetTimeBase() * clockspeed ); } What is the actual software release? The central path for BGP applications is  /bgsys/drivers/ppcfloor/  where libraries, include files, etc can be found. The actual PTF set of the BGP software can be found by checking ls -l /bgsys/drivers/ppcfloor which is a link to the driver version e.g.  /bgsys/drivers/ppcfloor -> /bgsys/drivers/V1R2M0_200_2008-080513P  which means <version>_<week*100>-<builddate> Core dump handling ## Core dumps are disabled on JUGENE by default. Due to the fact that writing core files from thousands of nodes takes (too) much time, the generating of core files is suppressed. ## How to enable core files? The BG_COREDUMPDISABLED environment variable is specified when a job is submitted with mpirun: BG_COREDUMPDISABLED=1 - core dumps disabled BG_COREDUMPDISABLED=0 - core dumps enabled The following mpirun command shows how to enable core files: mpirun -env BG_COREDUMPDISABLED=0 -exe filename.rts ## How to read core files? Core files are plain text files that include traceback information in hexadecimal. To read and convert the hexadecimal addresses the tool addr2line may help. For more information use addr2line -h (Compilation should have included option -g) How to use the GNU auto configure tools on JuGene? When using the GUN auto configure tools (automake, autoconf, ...) for the BlueGene/P backend compilers, the following sequence of commands seem to work best: export CXX=/bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc-bgp-linux-g++ export CC=/bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc-bgp-linux-gcc export CFLAGS="-g -O2 -dynamic" export CXXFLAGS="-g -O2 -dynamic" Then libtoolize --force aclocal autoheader automake autoconf ./configure --host=powerpc-bgp-linux ... Running configure on JuGene Compute Nodes Running configure on a system which requires cross-compiling like Blue Gene/P is often tricky. The configure script tries to start small pieces of code to check the environment. Starting configure within a LoadLeveler job works much better than using llrun for executing every small program. Here is an example how this can be done. The compiler settings, flags and execution steps highly depend on the software package and have to be taken from the install documentation. # @ job_name = HDF5_Config # @ comment = "BG/P Configure / 32" # @ error =$(job_name).$(host).$(jobid).out
# @ output = $(job_name).$(host).$(jobid).out # @ environment = COPY_ALL; # @ wall_clock_limit = 00:30:00 # @ notification = never # @ notify_user = t.user@fz-juelich.de # @ job_type = bluegene # @ bg_size = 32 # @ queue echo "=============================================" BG_BASE="/bgsys/drivers/ppcfloor" export CC="mpcc" export CXX="mpCC" export F77="mpxlf" export F90="mpxlf90" export F9X="mpxlf90" BG_INCLUDE="-I$BG_BASE/comm/include \
-I$BG_BASE/arch/include \ -I$BG_BASE/gnu-linux/powerpc-bgp-linux/sys-include "

export CFLAGS="-O3 -qarch=450 $BG_INCLUDE " export FFLAGS="-O3 -qarch=450$BG_INCLUDE "
export CXXFLAGS="-O3 -qarch=450 $BG_INCLUDE " export RUNSERIAL="mpirun -np 1 " export RUNPARALLEL="mpirun -np$${NPROCS:=3} " ./configure --enable-parallel --enable-fortran --disable-cxx --disable-stream-vfd echo "=============================================" make echo "=============================================" make check What are the conditions for temporary data? For temporary user data it is recommeded to use$WORK instead of /tmp, because $WORK is a lot bigger. Do not use /tmp! /tmp is very small (about 1 GB) and only available for programs runing on the Front-End-Nodes. Data will be held only for 7 days. Programs started with mpirun on a BlueGene partition are not able to access /tmp. Jobs trying this will be terminated. How can I link Fortran subroutines into my C program? To link Fortran subroutines either from libraries like essl or lapack or as parts of your own code into a C main program you have to link the following additional libraries after the Fortran routines and all Fortran libraries in your link statement: -L${XLFLIB_FZJ} -lxl -lxlopt -lxlf90_r -lxlfmath \
-L${XLSMPLIB_FZJ} -lxlomp_ser -lpthread FZJ has introduced the environment variables XLFLIB_FZJ and XLSMPLIB_FZJ as pointers to the recent compiler version, so that makefiles can be kept independent of the compiler changes. How can a soft limit for the wall clock time be used? At the moment there is no way to use the soft limit. The signal, which is send by LoadLeveler on the front-end node is not routed to the BlueGene application. The only, whereas not fully adequate, alternative is to check the user time from within the application and estimate manually the remaining time, see: How to do memory or time measurements in Fortran and C ? (The boot time for the partition is part of the wall_clock_limit.) Why is EOF not found on stdin when using llrun? It is urgently recommend not to read from stdin on Jugene! If you have a program doing a fixed number of reads, it works, but any read looking for EOF does not terminate when using llrun. (EOF works fine with mpirun in a LoadLeveler job.) It seems that EOF gets lost in stdin. This happens • with llrun, • with mpirun in a LoadLeveler job, • with C, • with Fortran. The Job will hang in all three situations: • mpirun .... < my_stdin • cat my_stdin | mpirun (as in the example in the Redbook) • # @ input = my_stdin (in the job control) As a workaround, you have to open the input file in your program and read from this file. You can pass the file name as an argument to your program if you do not want to recompile your program or to move your input file every time you change the name of your input. Why should jobs write regular checkpoints? The enhanced complexity of the new-generation supercomputers at JSC increases the probability that a job might be affected by a failure. Therefore, we strongly encourage all users of these systems to write regular checkpoints from their applications to avoid losses of CPU time when a job is aborted. There will be no refund of CPU time in the case of a failed job! Tip: Besides checkpointing, jobs with a time limit of less than the maximum of 24 hours might have a better turnaround time on JUGENE because they can be used to optimally fill the machine while it is being prepared for regular maintenance slots or full machine runs. How to use system calls on JUGENE? Because the compute nodes on the Blue Gene/P do not have a full Linux kernel implementation, some system routines are not available. Especially the "fork()", the "exec()" and the "call system" in FORTRAN do not work. However, there are replacement routines which can be used instead. You can find available routines in the Blue Gene/P Application Development Redbook Hints for FORTRAN users: These routines are C routines but the FORTRAN compiler includes them automatically as well. You just have to be careful when using strings, they have to have the "\\0" as last character. You can either add this to the string which you pass to the C routines or you can use the following compiler flag: -qnullterm Here an example FORTRAN program which renames "file1" to "file1a" and "file2" to "file2a": program system_calls  implicit none  include 'mpif.h'  character*5 :: filen1='file2'  character*6 :: filen2='file2a'  integer ierror,nranks,my_rank  call MPI_INIT(ierror)  call MPI_COMM_SIZE(MPI_COMM_WORLD,nranks,ierror)  call MPI_COMM_RANK(MPI_COMM_WORLD,my_rank,ierror)  if (my_rank == 0) then  write(*,*) 'Moving file1 to file1a'  call rename("file1","file1a")  write(*,*) 'Moving ',filen1,' to ',filen2  call rename(filen1,filen2)  endif  call MPI_FINALIZE(ierror) end Compilation: mpixlf90 -qnullterm *.f90 How to generate and upload ssh keys? In order to access the JSC computer systems you need to generate a SSH key pair. This pair consists of a public and a private part. Please follow the following links to the system specifc support pages to receive information how to generate your SSH key pair and how to manage your SSH connections in general: # Data questions How to share files by using ACLs? Linux file permission define the access rights to read, write or execute (rwx) files and directory but is limited to one user, one group and all others. ACLs (Access Control Lists) allows a more fine-grained assignment of access rights. The owner of a file/directory can define specific rights for other users and groups. ## Linux commands to manage ACLs - command to list ACLs of a file/directory: getfacl <file/directory> - Give user john1 read and write control to file example.txt. Also give user lucy1 the right to read this file: setfacl -m u:john1:rw example.txt setfacl -m u:jim1:r example.txt # file: example.txt # owner: smith1 # group: cjsc user::rw- user:john1:rw- user:lucy1:r-- group::--- mask::rw- other::--- - remove user john1 ACLs on example.txt: setfacl -x u:john1 example.txt # file: example.txt # owner: smith1 # group: cjsc user::rw- user:lucy1:r-- group::--- mask::rw- other::--- - Allow users from group zam change to directory share: setfacl -m g:zam:x share/ # file: share # owner: smith1 # group: cjsc user::rwx group::--- group:zam:--x mask::rw- other::--- - remove all ACLs from directory share:: setfacl -b share # file: share # owner: smith1 # group: cjsc user::rwx group::--- other::--- Further information (e.g. set ACLs recursively, setting default ACLs, inherit ACLs, ...) can be found in the manual pages. ## Which files have an access control list? The command ls -l will show a "+" for every file that has ACL set, eg. drwx------+ 2 john1 cjsc 32768 Feb 21 09:25 share How to restore files? # How to restore user or project data All file systems expect for$SCRATCH provide data protection mechanisms based either on the IBM Spectrum Protect (TSM) or the Spectrum Scale (GPFS) snapshot technology.

Especially for TSM only the JUDAC system is capable of retrieving lost data from the backup by using the command line tool adsmback:

Don't use the native dsmj-command which will not show any home data.

## $HOME - Users personal data All files within the users home directories ($HOME) are automatically backed up by TSM (Tivoli Storage Manager) function. To restore a file, use

on JUDAC.

This command grants access to the correct backup data of the user's assigned home directory.

 Restore -> View -> Display active/inactive files
 File level -> p> home -> jusers -> userid -> ...
 Select files or directories to restore
 Press [Restore] button

If the data should be restored to original location then choose within the Restore Destination window

• Original location

Otherwise select:

• Following location + <path> + Restore complete path

## $PROJECT - Compute project repository All files within the compute project directories ($PROJECT) are automatically backed up by TSM (Tivoli Storage Manager) function. To restore a file, use

on JUDAC.

This command grants access to the correct backup data of the project repository.

 Restore -> View -> Display active/inactive files
 File level -> p> project -> group -> ...
 Select files or directories to restore
 Press [Restore] button

If the data should be restored to original location then choose within the Restore Destination window

• Original location

Otherwise select:

• Following location + <path> + Restore complete path

## $FASTDATA - Data project repository (bandwidth optimized) The files within the data project directories ($FASTDATA) are not externally backed up to tape. Instead, an internal backup based on the snapshot feature from the file system (GPFS) is offered. The difference between the TSM backup and the snapshot based backup is, that TSM act on file changes while snapshots save the state at a certain point in time. Right now the following snapshots are configured:

 daily backup last day today, just after midnight weekly backup last week every Sunday, just after midnight monthly backup last three retention every 1st day of month, just after midnight

The snapshots can be found in a special subdirectory of the project repository. Go to

cd $FASTDATA/.snapshots and list contents /p/fastdata/jsc/.snapshots> ls daily-20181129 daily-20181130 daily-20181203 weekly-20181118 weekly-20181125 weekly-20181202 monthly-20181001 monthly-20181101 monthly-20181201 In the subdirectory <type>-<YYYYMMDD> the file version which was valid at date DD.MM.YYYY can be retrieved using the same path as the actual file is placed in the$FASTDATA repository.

Due to the fact that the snapshot is part of the file system, the data restore can be performed on any system where it is mounted.

## $DATA - Data project repository (large capacity) The files within the data project directories ($DATA) are not externally backed up to tape. Instead, an internal backup based on the snapshot feature from the file system (GPFS) is offered. The difference between the TSM backup and the snapshot based backup is, that TSM act on file changes while snapshots save the state at a certain point in time. Right now the following snapshots are configured:

 daily backup last three retention today, just after midnight weekly backup last three retention every Sunday, just after midnight monthly backup last three retention every 1st day of month, just after midnight

The snapshots can be found in a special subdirectory of the project repository. Go to

cd $DATA/.snapshots and list contents /p/largedata/jsc> ls daily-20181129 daily-20181130 daily-20181203 weekly-20181118 weekly-20181125 weekly-20181202 monthly-20181001 monthly-20181101 monthly-20181201 In the subdirectory <type>-<YYYYMMDD> the file version which was valid at date DD.MM.YYYY can be retrieved using the same path as the actual file is placed in the$DATA repository.

Due to the fact that the snapshot is part of the file system, the data restore can be performed on any system where it is mounted.

## $ARCHIVE - The Archive data repository All files within the user's archive directory ($ARCHIVE) for long term storage are automatically backed up by TSM (Tivoli Storage Manager) function. To restore a file, use

on JUDAC.

This command grants access to the correct backup data of the project's assigned archive directory.

 Restore -> View -> Display active/inactive files
 File level -> archX -> group -> ...
 Select files or directories to restore
 Press [Restore] button

If the data should be restored to original location then choose within the Restore Destination window:

• Original location

Otherwise select

• Following location + <path> + Restore complete path
How can I see which data is migrated?

There are two file systems which hold migrated data: /arch and /arch2

• These are so called archive file systems.
• In principle all data in the file systems will be migrated to TSM-HSM tape storage in tape libraries.
• Data is copied to TSM backup storage prior to migration.
• Data are not quoted by storage but by the number of files per group/project. This is done because UNIX is still not able to handle millions of files in a file system with an acceptable performance.

The TSM-HSM native command dsmls, which shows if a file is migrated, is not available on any HPC system (e.g. JUWELS, JURECA, ...) nor on the Data Access System (JUDAC). This command is only supported on the TSM-HSM node of the JUST storage cluster, that hosts the file systems for the HPC systems. However JUST is not open for user access.

to list the files. Migrated files can be identified by a block count of 0 in the first column (-s option) and an arbitrary number of bytes in the sixth column (-l option).

0 -rw-r----- 1 user group 513307 Jan 22 2008 log1
0 -rw-r----- 1 user group 114 Jan 22 2008 log2
0 -rw-r----- 1 user group 273 Jan 22 2008 log3
0 -rw-r----- 1 user group 22893504 Jan 23 2008 log4

How can I recall migrated data?

Normally migrated files are automatically recalled from TSM-HSM tape storage when the file is accessed on the login nodes of the HPC systems (e.g. JUWELS, JURECA, ...) or the Data Access System (JUDAC).

For an explicit recall the native TSM-HSM command dsmrecall is not available. Please use

tail <filename>
or:

to start the recall process. These commands will not change any file attribute and the migrated version of the file as well as the backup version stay valid.

It is strongly recommended NOT to use

touch <filename>

because this changes the timestamp of the file, so a new backup copy must be created and the file has to be migrated again. These are two additional processes that waste compute resources, if the file is used read only by further processing.

How to modify the users's environment.

When users login on an frontend node using the secure shell software a shell will be started and a set of environment variables will be exported. These are defined in system profiles. Each user can add/modify his environment by using his own profiles in his $HOME directory. In the Jülich setup there will be a separate$HOME directory for each HPC system. Which means that the environment differs between JUWELS, JURECA; JUDAC; ... and also the user can modify his own profiles for each system separately. Therefore a skeleton .bash_profile and .bashrc are placed in each $HOME directory when a user is joined to any HPC system. .bash_profile: # ************************************************** # bash environment file in$HOME
# http://www.fz-juelich.de/ias/jsc/EN/Expertise/D...
# **************************************************
# Get the aliases and functions: Copied from Cent...
if [ -f ~/.bashrc ]; then
. ~/.bashrc
fi
export PS1="[\u@\h \W]\$" .bashrc: # ************************************************** # bash environment file in$HOME
# http://www.fz-juelich.de/ias/jsc/EN/Expertise/D...
# **************************************************
# Source global definitions: Copied from CentOS 7...
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi

Separate $HOME directory for each HPC system E.g. on JUDAC user graf1 will see$HOME="/p/home/jusers/graf1/JUDAC". The profiles located here were used for login. Only the shared folder (link) points always to the same directory /p/home/jusers/graf1/shared.

Most side dependend variables are set automatically by the jutil env init command (system profile). The user can set the right core variables ($PROJECT,$ARCHIVE, ...) by using

jutil env activate -p <project>

How to make the currently enabled budget visible:

If a user has to change his budget account during a login session it might be helpful to see the currently set budget account in his prompt to be sure to work on the correct budget.Therefore one should replace the current "export PS!=..." line in .bash_profile by:

prompt() {
PS1="[${BUDGET_ACCOUNTS:-\u}@\h \W]\$ "
}
PROMPT_COMMAND=prompt

This results in the following behaviour:

[user1@juwels07 ~]$jutil env activate -p chpsadm [hpsadm@juwels07 ~]$ jutil env activate -p cslfse
[slfse@juwels07 ~]$What data quotas do exist and how to list usage? For all data repositories the disk quota managing is enabled. The values are set to default values (defined by JSC) or depend on special requirements by the projects. Default data quota per user/project within GPFS file systems File System Disk Space Number of Files Soft LimitHard Limit Soft LimitHard Limit$HOME10 GB11 GB40.00044.000
$SCRATCH90 TB95 TB4 Mio4.4 Mio$PROJECT16 TB17TB3 Mio3.1 Mio
$FASTDATAas granted to projectsoft limit + up to 10% additional as granted to projectsoft limit + up to 10% additional$DATAas granted to projectsoft limit + up to 10% additionalas granted to projectsoft limit + up to 10% additional
$ARCHIVEas granted to projectsoft limit + up to 10% additionalas granted to projectsoft limit + up to 10% additional File size limit Although the file size limit on operation system level e.g. at JUWELS or JURECA is set to unlimited (ulimit -f) the maximum file size can only be the GPFS group quota limit for the corresponding file system. The actual limits can be listed by jutil. List data quota and usage by project or user Members of a group/project can display the hard limits, quotas (soft limit) and usage by each user of the project using the jutil command. jutil project dataquota -p <project name> The quota information for the users are updated every 8 hours. Recommendation for users with a lot of small files Users with applications that create a lot of relatively small files should reorganize the data by collecting these files within tar-archives using the tar -cvf archive-filename ... command. The problem is really the number of files (inodes) that have to be managed by the underlying operating system and not the space they occupy in total. On the other hand please keep in mind the recommendations under File size limit. How to avoid multiple SSH connections on data transfer? When transferring multiple files, it can be problematic to use a separate SSH connection for each transfer operation. The network firewall can block a large amount of independent simultaneous SSH connections. There are different options to avoid multiple SSH connections: Use rsync or use scp with multiple files: rsync -avhzP local_folder/ username@host:remote_folder rsync only copies new or changed files, this will reserve transfer bandwith. scp -r local_folder/ username@host:remote_folder will copy local_folder recursively Use tar-container to transfer less files Creating a tar file and transfer it can be much faster compared to transferring all files separately: tar -cf tar.file local_folder The tar file creation, transmission and extraction process can also be done on the fly: tar -c local_folder/ | ssh username@host \ 'cd remote_folder; tar -x' Use shared SSH connection Shared SSH connections allow usage of the same connection multiple times: Open master connection: ssh -M -S /tmp/ssh_mux_%h_%p_%r username@host Reuse connection: ssh -S /tmp/ssh_mux_%h_%p_%r username@host A shared connection can also be used when using scp: scp -o 'ControlPath /tmp/ssh_mux_%h_%p_%r' \ local_folder username@host:remote_folder How to ensure correct group ID for Your project data? In our usage model all compute and data projects get a dedicated data repository in our global parallel file systems. The files stored into this directory belongs to the project. Therefore all files and sub-directory have to belong to project's UNIX group. To ensure that all data automatically belongs to this group the project directory has the setGID bit in place. New files will inherit the project UNIX group by default and sub-directories will get the setGID bit, too. But users can overrule this default behavior (willingly or by accident). To fix wrong group ownership on your files use chown :<group> <target_file> chown :zam /p/arch/zam/calculation/output.txt If you have a complete directory to fix use the recursive option: chown -R -h :zam /p/arch/zam/calculation On$ARCHIVE the quota usage is calculated on UNIX group base. Therefore nightly a recursive chown is performed on each project directory to apply the corresponding project group.

If the setGID bit is missing on a directory use

chmod g+s <target directory>
chmod g+s /p/arch/zam/calculation

If the setGID bit is missing in a complete directory tree, use find to fix it for all sub-directories:

find /p/arch/zam/calculation -type d -exec chmod g+s {} \;

What file system to use for different data?

In multiple GPFS file systems for different types of user data. Each file system has its own data policies.

• $HOME Acts as repository for the user’s personal data like the SSH key. There is a separate HOME folder for each HPC system and a shared folder which pointed on all systems to the same directory. Data within$HOME are backed up by TSM, see also

• $SCRATCH Is bound to a compute project and acts as a temporary storage location with high I/O bandwidth. If the application is able to handle large files and I/O demands,$SCRATCH is the right file system to place them. Data within $SCRATCH is not backed up and daily cleanup is done. • Normal files older than 90 days will be purged automatically. In reality modification and access date will be taken into account, but for performance reasons access date is not set automatically by the system but can be set by the user explicitly with touch -a <filename>. Time stamps that are recorded with files can be easily listed by stat <filename>. • Empty directories, as they will arise amongst others due to deletion of old files, will be deleted after 3 days. This applies also to trees of empty directories which will be deleted recursively from bottom to top in one step. •$PROJECT
Data repository for a compute project. It's lifetime is bound to the project lifetime. Data are backed up by TSM.

• $FASTDATA Belongs to a data project. This file system is bandwidth optimized (similar to$SCRATCH), but data are persistent and internally backed up via snapshots.

• $DATA Belongs to a data project. This file system is designed to store a huge amount of data on disk based storage. The bandwidth is moderate. The file-system internal backup is realized with the GPFS snapshot feature. For more information, look at •$ARCHIVE
Is bound to a data project and acts as storage for all files not in use for a longer time. Data are migrated to tape storage by TSM-HSM. It is recommended to use tar-files with a minimum size of of multiple Gigabytes and maximum of 8 TB. The background is that recalling/restoring files from tape is much more efficient using only a few large datastreams than thousends of small data streams. See also

All GPFS file systems are managed by quotas for disk space and/or number of files. See also