Search

link to homepage

Institute for Advanced Simulation (IAS)

Navigation and service


General FAQs about JUGENE

Error messages on JUGENE

Data questions

Error messages on JUGENE

DMA related RAS events APPL_0A2B and APPL_0201 --> improve communication performance

When using the option -verbose 2 with the mpirun command users might find information about the RAS events APPL_0A2B and/or APPL_0201 in the stderr. These events are not critical (i.e. the application does not fail). However, the performance might be not optimal.

APPL_0A2B - A DMA unit reception FIFO is full

Packets that arrive off the network are placed into a reception buffer. The default size is 8 MB per process. If a process is busy and does not call MPI_Wait often enough, this buffer can become full and the movement of further packets is stopped until the buffer is free again. This can slow down the application. In this case the buffer should be increased to the value given in the RAS event message using the environment variable DCMF_RECFIFO (see Tuning Applications for further information).

APPL_0201 - A DMA unit remote get injection FIFO is full

When a remote get packet arrives off the network, the packet contains a descriptor describing data that is to be retrieved and sent back to the node that originated the remote get packet. The DMA injects that descriptor into a remote get injection buffer. The DMA then processes that injected descriptor by sending the data back to the originating node. Remote gets are commonly done during point-to-point communications when large data buffers are involved (typically larger than 1200 bytes). The default size of the remote get injection FIFO is 32 KB. When a large number of remote get packets are received by a node, the remote get injection buffer may become full of descriptors. The size of the buffer can be adjusted to the value given in the RAS event message using the environment variable DCMF_RGETFIFO (see Tuning Applications for further information).

Error message ‘Invalid communicator’ when using BLACS and MPI

When BLACS (Basic Linear Algebra Communication Subprograms) is used in FORTRAN together with MPI the routine BLACS_GRIDINIT expects among others an MPI communicator as input argument. Leaving the routine, this communicator is overwritten by an internal BLACS context number. This context number is in general not a valid MPI communicator anymore. Using it for MPI routines might eventually result in an error message like:

Fatal error in MPI_Attr_get: Invalid communicator, error stack

You should save the corresponding MPI communicator before calling BLACS_GRIDINIT to use it for further MPI calls in order to avoid this error.

Hint: This error might be difficult to detect. On a number of systems the MPI_COMM_WORLD communicator is set to 0 (zero). Incidentally, the BLACS context number is zero on many systems, too. Therefore, the code might run just fine on such systems, since the communicator and the context number have identical values. However, on the BG/P the MPI_COMM_WORLD is set to 0x44 and the code aborts.

Error message ‘import site' failed; use -v for traceback

The error message

'import site' failed; use -v for traceback

might occur when using python in a batch job. This is due to the fact that the variable $HOME is not set on the I/O nodes of JUGENE. The error message can be avoided if $HOME is passed to the mpirun command:

mpirun -exp_env HOME ...

Application killed with signal 6 (executable too big)

Some users may find that some Blue Gene/P applications exit with a signal 6 and without any further error message. This may be caused by an executable, which is too big for the compute nodes, which have only 512 MB of memory per processor.

IBM works on giving out an error message in this situation in the next release of the BG/P software.

Application killed with signal 7 (memory alignment error)

Some users may find that some Blue Gene/P jobs exit with a segmentation fault and signal 7 (SIGBUS). This is due to an undocumented feature on the Blue Gene/P that deals with memory alignment during I/O operations. This feature allows the user to set a memory alignment error threshold for I/O operations in their job but results in confusion due to it's default value of 1000. This threshold can be controlled using an environment variable.

Die as soon as a memory alignment error is encountered during I/O:

-env BG_MAXALIGNEXP=1

Silence all memory alignment errors during I/O:

-env BG_MAXALIGNEXP=-1

Configure a threshold to workaround occasional memory alignment errors during I/O (eg. 2000 in this case):

-env BG_MAXALIGNEXP=2000

Note that memory alignment errors during I/O do have the potential to reduce I/O performance and as such it is better to eliminate them rather than ignore them if possible.

OSABI value of executable file is not UNIX System V

If a program execution aborts with the error message

Load failed on 134.94.xx.yy: OSABI value of executable file is not UNIX System V

the called executable might be one that was compiled for BlueGene/L.

The exit code of the LoadLeveler job is then 256.

Magic value in ELF header of executable file is invalid

If a program execution aborts with the error message

Load failed on 134.94.xx.yy: Magic value in ELF header of executable file is invalid

the called executable might be one that was compiled for AIX.

The exit code of the LoadLeveler job is then 256.

Environment variable list shouldn't be longer than 2049 chars

If a program execution aborts with the error message

Environment variable list for the job is .... characters
It shouldn't be longer than 2049 chars - Aborting.

Instead of mpirun -env_all use:
mpirun -exp_env VARNAME1 [ -exp_env VARNAME2 ]
to export the needed environment variables.

General FAQs about JUGENE

How to use system calls on JUGENE?

Because the compute nodes on the Blue Gene/P do not have a full Linux kernel implementation, some system routines are not available. Especially the "fork()", the "exec()" and the "call system" in FORTRAN do not work. However, there are replacement routines which can be used instead. You can find available routines in the Blue Gene/P Application Development Redbook

Hints for FORTRAN users: These routines are C routines but the FORTRAN compiler includes them automatically as well. You just have to be careful when using strings, they have to have the "\\0" as last character. You can either add this to the string which you pass to the C routines or you can use the following compiler flag:

-qnullterm

Here an example FORTRAN program which renames "file1" to "file1a" and "file2" to "file2a":

program system_calls
  implicit none
  include 'mpif.h'
  character*5 :: filen1='file2'
character*6 :: filen2='file2a'
  integer ierror,nranks,my_rank
  call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD,nranks,ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD,my_rank,ierror)
  if (my_rank == 0) then
write(*,*) 'Moving file1 to file1a'
call rename("file1","file1a")
write(*,*) 'Moving ',filen1,' to ',filen2
call rename(filen1,filen2)
endif
  call MPI_FINALIZE(ierror)
end
Compilation:
mpixlf90 -qnullterm *.f90
Why should jobs write regular checkpoints?

The enhanced complexity of the new-generation supercomputers at JSC increases the probability that a job might be affected by a failure. Therefore, we strongly encourage all users of these systems to write regular checkpoints from their applications to avoid losses of CPU time when a job is aborted. There will be no refund of CPU time in the case of a failed job!

Tip: Besides checkpointing, jobs with a time limit of less than the maximum of 24 hours might have a better turnaround time on JUGENE because they can be used to optimally fill the machine while it is being prepared for regular maintenance slots or full machine runs.

How can a soft limit for the wall clock time be used?

At the moment there is no way to use the soft limit.
The signal, which is send by LoadLeveler on the front-end node is not routed to the BlueGene application.

The only, whereas not fully adequate, alternative is to check the user time from within the application and estimate manually the remaining time, see: How to do memory or time measurements in Fortran and C ?
(The boot time for the partition is part of the wall_clock_limit.)

Why is EOF not found on stdin when using llrun?

It is urgently recommend not to read from stdin on Jugene!

If you have a program doing a fixed number of reads, it works, but any read looking for EOF does not terminate when using llrun.

(EOF works fine with mpirun in a LoadLeveler job.)

It seems that EOF gets lost in stdin. This happens

  • with llrun,
  • with mpirun in a LoadLeveler job,
  • with C,
  • with Fortran.

The Job will hang in all three situations:

  • mpirun .... < my_stdin
  • cat my_stdin | mpirun (as in the example in the Redbook)
  • # @ input = my_stdin (in the job control)

As a workaround, you have to open the input file in your program and read from this file. You can pass the file name as an argument to your program if you do not want to recompile your program or to move your input file every time you change the name of your input.

Running configure on JuGene Compute Nodes

Running configure on a system which requires cross-compiling like Blue Gene/P is often tricky. The configure script tries to start small pieces of code to check the environment.

Starting configure within a LoadLeveler job works much better than using llrun for executing every small program.

Here is an example how this can be done. The compiler settings, flags and execution steps highly depend on the software package and have to be taken from the install documentation.

# @ job_name = HDF5_Config
# @ comment = "BG/P Configure / 32"
# @ error = $(job_name).$(host).$(jobid).out
# @ output = $(job_name).$(host).$(jobid).out
# @ environment = COPY_ALL;
# @ wall_clock_limit = 00:30:00
# @ notification = never
# @ notify_user = t.user@fz-juelich.de
# @ job_type = bluegene
# @ bg_size = 32
# @ queue
echo "============================================="
BG_BASE="/bgsys/drivers/ppcfloor"
export CC="mpcc"
export CXX="mpCC"
export F77="mpxlf"
export F90="mpxlf90"
export F9X="mpxlf90"

BG_INCLUDE="-I$BG_BASE/comm/include \
-I$BG_BASE/arch/include \
-I$BG_BASE/gnu-linux/powerpc-bgp-linux/sys-include "

export CFLAGS="-O3 -qarch=450 $BG_INCLUDE "
export FFLAGS="-O3 -qarch=450 $BG_INCLUDE "
export CXXFLAGS="-O3 -qarch=450 $BG_INCLUDE "

export RUNSERIAL="mpirun -np 1 "
export RUNPARALLEL="mpirun -np $${NPROCS:=3} "

./configure --enable-parallel --enable-fortran --disable-cxx --disable-stream-vfd

echo "============================================="
make
echo "============================================="
make check

How to use the GNU auto configure tools on JuGene?

When using the GUN auto configure tools (automake, autoconf, ...) for the BlueGene/P backend compilers, the following sequence of commands seem to work best:

export CXX=/bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc-bgp-linux-g++
export CC=/bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc-bgp-linux-gcc
export CFLAGS="-g -O2 -dynamic"
export CXXFLAGS="-g -O2 -dynamic"

Then

libtoolize --force
aclocal
autoheader
automake
autoconf

./configure --host=powerpc-bgp-linux ...

Core dump handling

Core dumps are disabled on JUGENE by default.

Due to the fact that writing core files from thousands of nodes takes (too) much time, the generating of core files is suppressed.

How to enable core files?

The BG_COREDUMPDISABLED environment variable is specified when a job is submitted with mpirun:

BG_COREDUMPDISABLED=1 - core dumps disabled
BG_COREDUMPDISABLED=0 - core dumps enabled

The following mpirun command shows how to enable core files:

mpirun -env BG_COREDUMPDISABLED=0 -exe filename.rts

How to read core files?

Core files are plain text files that include traceback information in hexadecimal.

To read and convert the hexadecimal addresses the tool addr2line may help.
For more information use

addr2line -h

(Compilation should have included option -g)

What is the actual software release?

The central path for BGP applications is

/bgsys/drivers/ppcfloor/

where libraries, include files, etc can be found.

The actual PTF set of the BGP software can be found by checking

ls -l /bgsys/drivers/ppcfloor

which is a link to the driver version e.g.

/bgsys/drivers/ppcfloor -> /bgsys/drivers/V1R2M0_200_2008-080513P

which means

<version>_<week*100>-<builddate>

How to do time measurements in Fortran and C ?

Time measurement in Fortran:

For timing measurements on JUGENE Fortran users should use either the Fortran90 intrinsic system_clock or MPI_WTIME.


Time measurement in C:

#include < common/bgp_personality.h>
#include < common/bgp_personality_inlines.h>
#include < spi/kernel_interface.h>
static _BGP_Personality_t mybgp;
static double clockspeed=0.0;

/* initialize the clockspeed scaling factor */
unsigned bg_wtime_init() {
unsigned clockMHz;
Kernel_GetPersonality(&mybgp, sizeof(_BGP_Personality_t));
clockMHz = BGP_Personality_clockMHz(&mybgp);
clockspeed = 1.0e-6/(double)clockMHz;
return clockMHz;
}

/* return time in seconds */
double bg_wtime() {
return ( _bgp_GetTimeBase() * clockspeed );
}

How to determine the allocated memory of an application at runtime?

The available memory on a JUGENE node in VN mode is: 474 MB (with V1R4M2, June 2010)

Determining memory size/usage in C

Method 1:

#include < sys/resource.h>
#include < common/bgp_personality.h>
#include < common/bgp_personality_inlines.h>
#include < spi/kernel_interface.h&bt
static _BGP_Personality_t mybgp;

/* returns memory per core in MBytes */
unsigned bg_coreMB() {
unsigned procMB, coreMB;
Kernel_GetPersonality(&mybgp, sizeof(_BGP_Personality_t));
procMB = BGP_Personality_DDRSizeMB(&mybgp);
coreMB = procMB/Kernel_ProcessCount();
return coreMB;
}

/* return maximum memory usage of process in kBytes */
unsigned bg_usedKB() {
struct rusage usage;
if (getrusage(RUSAGE_SELF, &usage) != 0)
return 0;
return usage.ru_maxrss;
}

Compile with:

mpicc -O -I/bgsys/drivers/ppcfloor/arch/include

Method 2:

#include < bgp_SPI.h>

uint32_t allocated_memory = 0;
Kernel_GetMemorySize(KERNEL_MEMSIZE_HEAP, &allocated_memory);

Kernel_GetMemorySize along with KERNEL_MEMSIZE_HEAP gives back the allocated memory on heap of the application at runtime.

Example:

#include < mpi.h>
#include < stdlib.h>
#include < bgp_SPI.h>
#include < stdio.h>

const long MB = 1048576;

int main(int argc, char* argv[]) {
int myrank, nprocs;

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);

uint32_t allocated_memory = 0;
Kernel_GetMemorySize(KERNEL_MEMSIZE_HEAP, &allocated_memory);
printf ("Memory allocation on heap: %ld MB\n", allocated_memory/MB);

MPI_Finalize();
return 0;
}

Compile with:

mpixlc_r determine_memory_example.c -o determine_memory_example.x -I/bgsys/drivers/ppcfloor/arch/include/ -I/bgsys/drivers/ppcfloor/arch/include/spi

If you want to obtain the amount of memory allocated on the stack, then replace KERNEL_MEMSIZE_HEAP with KERNEL_MEMSIZE_STACK given in the example above.

Determining memory size/usage in Fortran

Example:

#include < mpi.h>
#include "bgp_SPI.h"
#include < sys/times.h>
#include < sys/resource.h>
#include < assert.h>
#include < stdio.h>
#include < errno.h>
#include < string.h>

void print_memory_usage( void )
{

unsigned int memory_size = 0;

Kernel_GetMemorySize(KERNEL_MEMSIZE_HEAP, &memory_size);

fprintf(stderr, "Memory size HEAP: %10ld \n",
(long) memory_size);

Kernel_GetMemorySize(KERNEL_MEMSIZE_STACK, &memory_size);

fprintf(stderr, "Memory size STACK: %10ld \n",
(long) memory_size);

Kernel_GetMemorySize(KERNEL_MEMSIZE_HEAPAVAIL, &memory_size);

fprintf(stderr, "Memory available HEAP: %10ld \n",
(long) memory_size);

Kernel_GetMemorySize(KERNEL_MEMSIZE_STACKAVAIL, &memory_size);

fprintf(stderr, "Memory available STACK: %10ld \n",
(long) memory_size);

Kernel_GetMemorySize(KERNEL_MEMSIZE_HEAPMAX, &memory_size);

fprintf(stderr, "Maximum memory HEAP: %10ld \n",
(long) memory_size);

}

Compile with:

mpicc -c -I/bgsys/drivers/ppcfloor/arch/include/ -I/bgsys/drivers/ppcfloor/arch/include/spi memory_usage.c

In order to link against the SPI libraries the environment variable MPI_CCLIBS has to be set in the following way:

export MPI_CCLIBS=" -Wl,-rpath,/bgsys/drivers/V1R4M1_460_2009-091110P/ppc/runtime/SPI -lSPI.cna "

The code snippet given below shows how the amount of allocated memory can be obtained for a program at runtime:

program memory_usage

implicit none

integer, parameter :: zeilen=1000
integer, parameter :: spalten=10000
integer :: i

integer,dimension(10000000),parameter :: stack_test=999

integer, dimension(:,:), allocatable :: matrix

write(*,*)('=',i=1,60)
call print_memory_usage()

write(2, '(10i10)' )stack_test

write(*,*)('=',i=1,60)
call print_memory_usage()

allocate(matrix(zeilen,spalten))
matrix=-100
write(1, '(10i10)' )matrix

write(*,*)('=',i=1,60)
call print_memory_usage()

deallocate(matrix)

write(*,*)('=',i=1,60)
call print_memory_usage()
write(*,*)('=',i=1,60)

end program memory_usage

Compile and link with:

mpif90 get_memory.f90 memory_usage.o $MPI_CCLIBS


Environment variables

In case memory runs low in your jobs you can try to increase the memory available for your application on the nodes by using one or more of the environment variables, described in Compiling and Tuning.

How can I link Fortran subroutines into my C program?

To link Fortran subroutines either from libraries like essl or lapack or as parts of your own code into a C main program you have to link the following additional libraries after the Fortran routines and all Fortran libraries in your link statement:

-L${XLFLIB_FZJ} -lxl -lxlopt -lxlf90_r -lxlfmath \
-L${XLSMPLIB_FZJ} -lxlomp_ser -lpthread

FZJ has introduced the environment variables XLFLIB_FZJ and XLSMPLIB_FZJ as pointers to the recent compiler version, so that makefiles can be kept independent of the compiler changes.

What are the conditions for temporary data?

For temporary user data it is recommeded to use $WORK instead of /tmp, because $WORK is a lot bigger.

Do not use /tmp!
/tmp is very small (about 1 GB) and only available for programs runing on the Front-End-Nodes. Data will be held only for 7 days.

Programs started with mpirun on a BlueGene partition are not able to access /tmp. Jobs trying this will be terminated.

How to generate and upload ssh keys?

In order to access the JSC computer systems you need to generate an ssh key pair. This pair consists of a public and a private part. Here we briefly describe how to generate and upload such a pair.

On Linux/UNIX

In order to create a new ssh key pair login to your local machine from where you want to connect to the JSC computer systems. Open a shell and use the following command

ssh-keygen -b 2048 -t rsa

You are asked for a file name and location where the key should be saved. Unless you really know what you are doing, please simply take the default by hitting the enter key. This will generate the ssh key in the .ssh directory of your home directory ($HOME/.ssh).
Next, you are asked for a passphrase. Please, choose a secure passphrase. It should be at least 8 characters long and should contain numbers, letters and special characters like !@#$%^&*().

Important: You are NOT allowed to leave the passphrase empty!

You will be asked to upload the public part of your key ($HOME/.ssh/id_rsa.pub) on the JSC web site when you apply for an account. You must keep the private part ($HOME/.ssh/id_rsa) confidential.

Important: Do NOT remove it from this location and do NOT rename it!

You will be notified by email once your account is created and your public key is installed. To login, please use

ssh <yourid>@<machine>.fz-juelich.de

where 'yourid' is your user id on the JSC system 'machine' (i.e. you have to replace 'machine' by the corresponding JSC system). You will be prompted for your passphrase of the ssh key which is the one you entered when you generated the key (see above).

On Windows

You can generate the key pair using for example the PuTTYgen tool, which is provided by the PuTTy project. Start PuTTYgen and choose SSH-2 RSA at the bottom of the window and set the 'number of bits in the generated key' to 2048 and press the 'Generate' button.

PuTTYgen will prompt you to generate some randomness by moving the mouse over the blank area. Once this is done, a new public key will be displayed at the top of the window.

Enter a secure passphrase. It should be at least 8 characters long and should contain numbers, letters and special characters like !@#$%^&*().

Important: You are NOT allowed to leave the passphrase empty!

Save the public and the private key. We recommend to use 'id_rsa.pub' for the public and 'id_rsa' for the private part.

You will be asked to upload the public part of your key (id_rsa.pub) on a JSC web site when you apply for an account. You must keep the private part (id_rsa) confidential.

You will be notified by email once your account is created and your public key is installed. To login, please use an ssh client for Windows, use authentication method 'public-key', import the key pair you have generated above and login to the corresponding JSC system with your user id. If you are using the PuTTy client you can import the key in the configuration category 'Connection', subcategory 'ssh' -> Auth. Once this is done you will be prompted for your passphrase of the ssh-key which is the one you entered when you generated the key (see above).

Adding additional keys

If you would like to connect to your account from more than one computer, you can create and use additionals pairs of public and private keys:

After creating a pair of public/private keys there are two ways of installing the public key on the target machine:

Method 1 (Linux/Mac):

Use the ssh-copy-id command to simultaneously upload and add the public key file 'public_key.pub' to the account 'user' on the target machine 'target':

ssh-copy-id -i public_key.pub user@targetmachine

Please refer to the man-page of ssh-copy-id for further information.

Method 2 (all operating systems):

ii) upload the public key file to your account at the HPC-target system

ii-a) In case the public key was created under Windows (e.g. in Putty) it has to be converted. This is done on the target HPC-system by the command

ssh-keygen -i -f original_public_key_file.pub > new_public_key_file.pub

iii) open the (new) keyfile and copy the whole line

iv) append the line as a new line to the file ~/.ssh/authorized_keys

v) Make sure the private key sits in the correct place on your private computer.

Replace SSH Key

In case the ssh key has to be replaced, use the following link: Upload of ssh-key

Note: This will replace ALL public keys by the new public key. If you use more than one key pair you will have to add your additional public keys as described above.

Data questions

How to restore a file from the archive directory?

All files within the user's archive directory ($ARCH) for long term storage are automatically backed up by TSM (Tivoli Storage Manager) function. To restore a file, use

adsmback [-type=arch] &

on the login-nodes of JUQUEEN or JUDGE, or the GPFS gateway-nodes at JUROPA. If the option -type is not specified, the user will be prompted for the type of filesystem

Which type of filesystem should be restored? Enter: {home | arch | gpfshome}

This command grants access to the correct backup data of the user's assigned archive directory.

Follow the GUI by selecting:

File level -> archX -> group -> userid -> ...
Select files or directories to restore
Press [Restore] buttom


If the data should be restored to original location then choose within the Restore Destination window:

  • for JUQUEEN: Original location
  • for JUDGE: Original location
  • for JUROPA: Following location + /gpfs/archX + Restore complete path

Don't use the native dsmj-command which will not show any archive data

How can I see which data is migrated?

There are three file systems that hold migrated data: /arch, /arch1, /arch2

  • These are so called archive file systems.
  • In principle all data in the file systems will be migrated to TSM HSM tape storage in tape libraries.
  • Data is copied to TSM backup storage prior to migration.
  • Every user owns a personal archive directory that can be specified by the $ARCH resp. $GPFSARCH variable.
  • Data are not quoted by storage but by the number of files per group/project. This is done because UNIX is still not able to handle millions of files in a file system with an acceptable performance.

The TSM-HSM native command dsmls, which shows if a file is migrated, is not available on JUQUEEN nor on JUDGE nor on JUROPA. This command could only run on JUST, the storage cluster, that hosts the file systems for the HPC systems. However JUST is not open for user access.

Please use

ls -ls [mask | filename]

to list the files. Migrated files can be identified by a block count of 0 in the first column (-s option) and an arbitrary number of bytes in the sixth column (-l option).

0 -rw-r----- 1 user group 513307 Jan 22 2008 log1
0 -rw-r----- 1 user group 114 Jan 22 2008 log2
0 -rw-r----- 1 user group 273 Jan 22 2008 log3
0 -rw-r----- 1 user group 22893504 Jan 23 2008 log4

How to restore a file from the home directory?

All files within the users home directories ($HOME) are automatically backed up by TSM (Tivoli Storage Manager) function. To restore a file, use

adsmback [-type={home | gpfshome} ] &

on the login-nodes of JUQUEEN or JUDGE, or the GPFS-gateway nodes at JUROPA. If the option -type is not specified, the user will be prompted for the type of filesystem

Which type of filesystem should be restored? Enter: {home | arch | gpfshome}

This command grants access to the correct backup data of the user's assigned home directory. 'gpfshome' applies to JUROPA only because JUROPA users have an additional GPFS home directory besides the standard Lustre home directory.

Follow the GUI by selecting:

File level -> [j]homeX -> group -> userid -> ...
Select files or directories to restore
Press [Restore] buttom

If the data should be restored to original location then choose within the Restore Destination window:

  • for JUQUEEN: Original location
  • for JUDGE: Original location
  • for JUROPA (GPFS): Following location + /gpfs/homeX + Restore complete path

Don't use the native dsmj-command which will not show any home data.

How can I recall migrated data?

Normally migrated files are automatically recalled from TSM-HSM tape storage when the file is accessed at JUQUEEN (login nodes only), JUDGE (login and compute nodes), or JUROPA (GPFS gateway nodes only).

For an explicit recall the native TSM-HSM command dsmrecall is not available. Please use

tail <filename>
or:
head <filename>

to start the recall process. These commands will not change any file attribute and the migrated version of the file stays valid.

It is strongly recommended not to use

touch <filename>

because this changes the timestamp of the file, so a new backup copy must be created and the file has to be migrated again. These are two additional processes that waste compute ressources if the file is used read only by further processing.

What data quotas do exist and how to list usage?

Disk quota limitations in $HOME and $WORK file systems are in effect since end of October 2007. This had to be done because in the past file systems were blocked by creating millions of files by single users which caused performance for system commands (ls, du) to be degraded. Also migration for $HOME data didn't work successfully any longer and therefore the new type of archive file system $ARCH was introduced. The following limitations apply since December 2009 in general. The numbers are updated according to the actual capacity in the file systems.


Data quota per group/project within GPFS file systems

File System

Disk Space

Number of Files

Soft LimitHard Limit Soft LimitHard Limit
$HOME6 TB7 TB2 Mio2.2 Mio
$WORK20 TB21 TB4 Mio4.4 Mio
$ARCH- (see note)2 Mio2.2 Mio

Note:
No hard disk space limit for $ARCH exists but if more than 100 TB will be requested please contact the supercomputing support at JSC ( sc@fz-juelich.de ) to discuss optimal data processing particularly with regard to the end of the project. Furthermore for some projects there may exist special guidelines.

File size limit

Although the file size limit on operation system level ( Linux for JUDGE and JUQUEEN) is set to unlimited (ulimit -f) the maximum file size can only be the GPFS group quota limit for the correspondig file system. The actual limits can be listed by q_dataquota.

List data quota and usage by group and user

Members of a group/project can display the hard limits, quotas (soft limit) and usage by each user of the group in a group special file (/homex/group/usage.quota) that is updated every three hours within prime shift (see timestamp at the top of the file). Since End of January 2013 for easy reading the unit of measure is set to GB instead of KB. This causes that the displayed values are always rounded up to the next GB-value. If less then 1 GB are used e.g. 256 KB or 128 MB there will be always 1 GB to be seen.

more $HOME/../usage.quota

This file can also be listed in a short and long format by the command

q_dataquota [-l]

The short format will display the group quota limits and group data usage for each file system followed by the usage of the user herself/himself. The long listing includes the data usage of all users of the group in descending order.

Notes:

  • Although no quota limits for a group may be listed for the $WORK file system quotas are set! Counting quotas will start with the first file created by a user of the group.
  • If the message Cannot exceed the user or group quota is displayed when writing data to a file the sum of used and in_doubt blocks has exceeded the hard limit. Please be aware of that not only the used blocks are taken into account!
  • The column grace reports the status of the quota

    none - no quota exceeded
    xdays - remaining grace period to clean up after the soft limit is exceeded
    expired - no data can be written before cleanup

List in time data quota and usage by group

A prompt update of the group's data usage and limits can be displayed with:

mmlsquota -g <group> [ <FS_without_leading_/> | -C just.fz-juelich.de ]

The output for the specified file system or all file systems of the JUST storage cluster will show the usage summary of the specified group (not the members) in KByte units by default. For better reading a unit of measure can be specified or GPFS can select the best that fits. To do so specify the option (with GPFS 3.5.x)

--block-size {M|G|T|auto}


System actions when limits are exceeded

  • Soft limit
    If any soft limit is exceeded a grace period of 14 days starts to count down. If no data will be deleted to be under the limit the quota will be expired after the grace period and no files can be created or expanded any longer. If in the meantime the hard limit is exceeded the quota is expired directly.
  • Hard limit
    If any hard limit is exceeded (sum of used and in_doubt are taken into account) the users in the group cannot create any new files or expand existing files in the correponding file system until the number of files or disk space allocated is less than the limit.

Recommendation for users with a lot of small files

Users with applications that create a lot of relatively small files should reorganize the data by collecting these files within tar-archives using the

tar -cvf archive-filename ...

command. The problem is really the number of files (inodes) that have to be managed by the underlaying operating system and not the space they occupy in total. On the other hand please keep in mind the recomendations under File size limit.

How to share files by using ACLs?

ACLs (Access Control Lists) provide a means of specifying access rights on files. GPFS access control lists allow the definition of access rights for other users or groups.

Create or change a GPFS access control list

mmeditacl <filename>

which will open the ACL-definition of <filename> with an editor.


Note that for this command to work the EDITOR environment variable must contain a complete path name, for example on JUQUEEN: export EDITOR=/usr/bin/vim

Example:
Set read and execute permission for user user1 and execute permission only for user2 to directory dir1:

mmeditacl dir1
.... (append 3 lines to the displayed lines) ....
mask::r-x-
user:user1:r-x-
user:user2:--x-

Note that mask must have the maximum permission compared to any user permission of this ACL and that access must be granted to every directory in the hierarchy (esp. the home directory). The 4th character stands for the GPFS specific control permission.

When the file is saved, the following has to be answered:

mmeditacl: 6027-967 Should the modified ACL be applied? (yes) or (no)

Which files have an access control list?

The command

ls -l

will show a "+" for every file that has ACL set, eg.

drwx------+ 2 user group 32768 Feb 21 09:25 dir1

Delete a GPFS access control list

mmdelacl <filename>

or remove the added lines by mmeditacl.

Apply a GPFS ACL recursively

Example:
Apply ACL to all subsequent files and directories below dir1, use:

for i in `find dir1`
do
mmgetacl dir1 | mmputacl $i
done

Documentation

Please see the man pages or IBM documentation for further commands:

mmdelacl, mmgetacl, mmputacl

What file system to use for different data?

In principle there are three GPFS file systems for different types of user data. Each file system has its own data policies.

  • $HOME
    Acts as repository for source code, binaries, libraries and applications with small size and I/O demands. Data within $HOME are backed up by TSM, see also

  • $WORK
    Acts as a temporary storage location with high I/O bandwidth (measured 160 GB/s from an JUQUEEN application). If the application is able to handle large files and I/O demands, $WORK is the right file system to place them. Data within $WORK is not backed up and daily cleanup is done.

    • Normal files older than 90 days will be purged automatically. In reality modification and access date will be taken into account, but for performance reasons access date is not set automatically by the system but can be set by the user explicitly with
      touch -a <filename>.
      Time stamps that are recorded with files can be ealily listed by
      stat <filename>.
    • Empty directories, as they will arise amongst others due to deletion of old files, will be deleted after 3 days. This applies also to trees of empty directories which will be deleted recursively from bottom to top in one step.
  • $ARCH
    Acts as storage for all files not in use for a longer time. Data are migrated to tape storage by TSM-HSM. It is recommended to use tar-files with a maximum size of 1 TB. This is caused by the speed for reading/writing data from/to tape. All data in $ARCH first has to be backed up which will take 10h for 1TB. Next the data will be migrated to tape which will take 3h per 1TB. Please keep in mind that a recall of the data will need approximately the same time. See also

All GPFS file systems are managed by disk space and/or number of files quotas, see also


Servicemeu

Homepage