# Debugging Parallel Applications

If an application aborts unexpectedly, it is useful to monitor the execution of the application in more detail, i.e. to check which branches of the code are actually executed, what are the actual values of variables, which part of the memory is used etc.

The simplest way to do this debugging is to use print statements in the code in order to get the desired information. However, this is tedious (each time a print or write statement is added the source needs to be recompiled and rerun). Furthermore, since the code is modified the runtime conditions change and may influence the behavior of the applications. Therefore, this way of debugging is not recommended.

Instead, in the first place the compiler offers the possibility to check for certain errors during the compilation of the code. For this special compiler flags have to be used which will be described in more detail in the next section. It is recommended to go this way first when debugging is necessary, because the usage is quite easy and does not require any additional software.

But not all errors can be detected this way since some occur only at run time. In this case debuggers need to be employed. Debuggers are powerful tools to analyse the executions of applications on the fly, i.e. while they are running. In general, the corresponding applications need to be recompiled once using appropriate compiler flags and are then executed under the control of the debugger.

For an overview, see the slides "Module Setup and Compiler" from the last Supercomputer Usage class.

## Compiler flags

### Debugging options of the compilers

In the following useful debugging options for the XL compilers are listed and explained. Simply add them to the compile command you usually use for your application. The information are taken from the man pages of the XL compilers, for further information about compiler flags just type man bgxlf or man bgxlc.

-O0With this option all optimizations performed by the compiler are switched off. Sometimes errors can occur due to too aggressive compiler optimizations (rounding of floating point numbers, rearrangement of loops and/or operations etc.). If you encounter problems that might be connected to such issues (for example, wrong or inaccurate numeric results) try this option and check whether the problem persists. If not, increase moderately the optimization level.
-qcheck[=<suboptions_list>]

For Fortran this option is identical to the -C option (see list of flags for Fortran codes below). For C/C++ codes this option enables different runtime checks, depending on the suboptions_list (colon-separated list, see below) specified, and raises a runtime exception (SIGTRAP signal) if a violation is encountered.

 all Enables all suboptions. bounds Performs runtime checking of addresses when subscripting within an object of known size. divzero Performs runtime checking of integer division. A trap will occur if an attempt is made to divide by zero. nullptr Performs runtime checking of addresses contained in pointer variables used to reference storage.
-qflttrap[=<suboptions_list>]

Generates instructions to detect and trap runtime floating-point exceptions.

<suboptions_list> is a colon-separated list of one or more of the following suboptions:

 enable Enables trapping of specified exception. imprecise Only checks for the specified exceptions on subprogram entry and exit. inexact Detects floating-point inexact exceptions. invalid Detects floating-point invalid operation exceptions. nanq Generates code to detect and trap NaNQ (Quiet Not-a-Number) exceptions handled or generated by floating-point operations. overflow Detects floating-point overflow. underflow Detects floating-point underflow. zerodivide Detects floating-point division by zero.
-qhalt=<sev>

Stops the compiler after the first phase if the severity level of errors detected equals or exceeds the specified level <sev>. The severity levels in increasing order of severity are:

 i informational messages l language-level messages (Fortran only) w warning messages e error messages s severe error messages u unrecoverable error messages (Fortran only)
-qinitauto=[<hex_value>]Initializes each byte or word of storage for automatic variables to the specified hexadecimal value <hex_value>. This generates extra code and should only be used for error determination. If you specify -qinitauto without a <hex_value>, the compiler initializes the value of each byte of automatic storage to zero.

The following flags can be used only with Fortran codes:

 -C Checks each reference to an array element, array section, or character substring for correctness. This way some array-bound violations can be detected. -qinit=f90ptr Makes the initial association status of pointers disassociated instead of undefined. This option applies to Fortran 90 and above. The default association status of pointers is undefined. -qsigtrap[=] Sets up the specified trap handler to catch SIGTRAP exceptions when compiling a file that contains a main program. This option enables you to install a handler for SIGTRAP signals without calling the SIGNAL subprogram in the program.

The following flags apply only to C/C++ codes:

-qformat=[<options_list>]

Warns of possible problems with string input and output format specifications. Functions diagnosed are printf, scanf, strftime, strfmon family functions and functions marked with format attributes.
<options_list> is a comma-separated list of one or more of the following suboptions:

 all Turns on all format diagnostic messages. exarg Warns if excess arguments appear in printf and scanf style function calls. nlt Warns if a format string is not a string literal, unless the format function takes its format arguments as a va_list. sec Warns of possible security problems in use of format functions. y2k Warns of strftime formats that produce a 2-digit year. zln Warns of zero-length formats.
-qinfo[=[<suboption>][,<groups_list>]]

Produces or suppresses additional informational messages. <groups_list> is a colon separated list. If a <groups_list> is specified along with a <suboption>, a colon must separate them.

The suboptions are:

 all Enables all diagnostic messages for all groups. private Lists shared variables that are made private to a parallel loop. reduction Lists variables that are recognized as reduction variables inside a parallel loop.

The list of groups that can be specified is extensive. Here only a few are given.

For a complete list please refer to the manual page of the bgxlc compiler.

 c99 C code that might behave differently between C89 and C99 language levels cls C++ classes cmp Possible redundancies in unsigned comparisons cnd Possible redundancies or problems in conditional expressions gen General diagnostic messages ord Unspecified order of evaluation ppt Trace of preprocessor actions uni Uninitialized variables

### Compiler flags for using debuggers

In order to run your code under the control of a debugger, you need to recompile your application including the following compiler flags (XL compilers):

 -g -qfullpath

 -qkeepparm

may be useful. When specified, it ensures that function parameters are stored on the stack even if the application is optimized. As a result, parameters remain in the expected memory location, providing access to the values of these incoming parameters to debuggers.

## Available debuggers

Once you have compiled your application with the correct compiler flags you can run your application under the control of a debugger and monitor the behavior on the fly in detail.

### DDT

The Distributed Debugging Tool (DDT) is a graphical debugger supporting C, C++, Fortran 77, and Fortran 90 programs. Among other features it offers:
• 1D + 2D Array Data visualization
• Support for MPI parallel debugging (automatic attach, message queues)
• Support for OpenMP (Version 2.x and later)
• Job submission from within debugger

#### Running DDT on JUGENE

Important: In order to be able to use the graphical user interface please make sure you are logged in with ssh -X.
If you are not directly connected to JUGENE, make sure you are using for all ssh connections the -X option and that your local system (laptop, PC) has a running X server!

In order to debug your program load the UNITE and ddt modules first:
Then start the DDT debugger typing
ddt

After clicking on the DDT logo a welcome dialog box appears.

Choose Run and Debug a Program.

In the DDT Run dialog box:

• select your application (after compilation with the appropriate compiler flags),
• adjust the number of nodes and OpenMP settings if applicable,
• (for further options, click Advanced),
• click Submit.
The application is submitted to the batch system and queued.

Once the job is launched DDT will attach to the application, the DDT process window will appear and you can start to debug your application.

For further information about the DDT debugger and its capabilities please see the
DDT Documentation (allinea Software)

### Totalview

Totalview is a very powerful debugger supporting C, C++, Fortran 77, Fortran 90, PGI HPF and assembler programs and offers among others the following features:
• C++ support (templates, inheritance, inline functions)
• F90 support (user types, pointers, modules)
• 1D + 2D Array Data visualization
• Support for parallel debugging (MPI: automatic attach, message queues, OpenMP, pthreads)
• Scripting and batch debugging
• Memory Debugging
• Reverse Debugging with ReplayEngine

#### Using Totalview interactively

Important: In order to be able to use the graphical user interface please make sure you are looged in with ssh -X If you are not directly connected to JUGENE, make sure you are using for all ssh connections the -X option and that your local system (laptop, PC) has a running X server!

In order to debug you program with Totalview load the UNITE and Totalview modules first:
The most common way to use Totalview (like any other debugger) is an interactive usage with a graphical user interface. In order to do so start your application (after compilation with the appropriate compiler flags) with llrun using the option -tv.

For example:
This will start the program application.x with <ntasks> and <nthreads> per task in VN mode. At most 2048 tasks can be viewed in VN mode. If your application is a pure MPI code, you can omit the -env option. After the corresponding partition is booted Totalview will launch three windows, the root window, the startup-parameter and the process window.

In the startup-parameter window, you have the four tags Debugging Options, Arguments, Standard I/O and Parallel. If you wish to acitvate the memory debugging check the corresponding box in the tag Debugging Options. If you would like to change or add the arguments, which are passed to your application or to mpirun, you can do so under Arguments. Please do not change anything in Parallel. Once you have made all changes needed, click on OK.

Click on GO in the process window of Totalview. Totalview will proceed executing the mpirun command and launch your application. This may take several minutes depending on the size of the partition you have requested (i.e. the number of task you would like to run).
A dialog window appears after clicking on GO.

Click on YES and after a few seconds the source code of the main program of your application appears in the process window and you can start debugging your code.

For a detailed description of the usage of Totalview, please refer to the Totalview Documentation (Rogue Wave Software) for a user's guide and further information about Totalview.

#### Using Totalview in batch mode

Sometimes using the interactive GUI for debugging is not straightforward, for example in cases where the error occurs after several hours of execution. In this case it would be very cumbersome to wait until the code has reached the corresponding spot.
In such cases Totalview can be executed in batch mode. Prepare a job command file and launch you application with tvscript instead of mpirun .

The general syntax for tvscript on JUGENE is
tvscript [ options ] –mpi BlueGene - np <ntasks> –starter_args “<filename> [ mpi-arguments ] [ -args program_args ]” mpirun

Here [options] are tvscript options, <filename> is the name of the executable to debug (must be the first of the starter_args ) and -args is followed by the arguments which are usually specified with the same option of the mpirun command. The last command must be mpirun.

Example: Job command script using tvscript:
# @job_name = tvscript_dbg
# @comment = "batch debugging"
# @output = tvscript_dbg.out
# @error = tvscript_dbg.err
# @environment = COPY_ALL
# @job_type = bluegene
# @bg_size = 32
# @wall_clock_limit = 00:30:00
# @queue
tvscript -create_actionpoint "functionA=>display_backtrace -show_arguments" -mpi BlueGene -np 4 -starter_args "application.x -mode VN" mpirun

The executable to debug is application.x and should run with 4 tasks in VN mode. At the beginning of the function named functionA an action point is created. When tvscript reaches that action point, it logs a backtrace and the method’s arguments.

Running this job script, two log files are created by tvscript:
mpirun-<date>_<time>.slog
mpirun-<date>_<time>.log

The slog file (Summary Log File) contains a summary which events occured. In the example above, this file contains four lines (one for each task):
Actionpoint function hit, performing action display_backtrace with \
options -show_arguments
Actionpoint function hit, performing action display_backtrace with \
options -show_arguments
Actionpoint function hit, performing action display_backtrace with \
options -show_arguments
Actionpoint function hit, performing action display_backtrace with \
options -show_arguments
This indicates that all tasks reached the defined action point and performed the corresponding action (show the arguments of the function function).
The log file contains more detailed information. In this case it lists (for each task) the names and values of the arguments of the function functionA.
For further information about tvscript and a complete list of options, please see the Totalview Documentation.

## Analyzing core dumps

If an application aborts due to an error the current status of the memory usage of the application can be written to disk (core dump) before the execution stops. Due to the fact that writing core files from thousands of nodes takes (too) much time, the generating of core dumps is suppressed. However, you can enable the generation of core dumps exporting the environment variable BG_COREDUMPDISABLED to 0 in your job command file:
mpirun -env BG_COREDUMPDISABLED=0 [ <other mpirun options> ] application.x
where application.x is your application. Please use the -g option when compiling your application in case you would like to analyze core dumps.

Important: Use this option with care, because a core dump for each process is generated, i.e. running with 16000 MPI tasks means that 16000 core dump files are generated! Before using this option try to reproduce the error with the least number of tasks possible!

### Core dump analysis with DDT

To debug using core files, start DDT. Then click the Open Core Files button on the welcome screen. This opens the Open Core Files window, which allows you to select an executable and a set of core files. Click OK to open the core dump files and start debugging them. While DDT is in this mode, you cannot play, pause or step (because there is no process active). You are, however, able to evaluate expressions and browse the variables and stack frames saved in the core dump files. The End Session menu option will return DDT to its normal mode of operation.

### Core dump analysis with Totalview

Start totalview. After the source code of your application appears in the process window, go to the menu File and select New Program. Select Open a core file in the dialog box which appears and choose a core file. The process window displays the core file, with the Stack Trace, Stack Frame, and Source Panes showing the state of the process when it dumped core. The title bar of the process window names the signal that caused the core dump. The right arrow in the line number area of the Source Pane indicates the value of the program counter (PC) when the process encountered the error.

## Further reading, information and references

General related information: