#### **PIConGPU: 10 years of living with the HPC hardware zoo**

Michael Bussmann 19/05/2020 | JSC MSA Seminar



www.casus.science





SPONSORED BY THE

Federal Ministry of Education and Research KULTUR UND TOURISMUS



#### Radiation Tumor Therapy







#### Radiation Tumor Therapy with Ions



8 X-Ray Beams





#### Accelerators can become quite big machines



#### CASUS CENTER FOR ADVANCED SYSTEMS UNDERSTANDING

#### Can we make them smaller?



## **Plasma accelerators** Lasers FTW!







#### Making everything very easy with levitating platic spheres



# In theory, theory is easy





#### Plasma accelerators (2015 on ORNL TITAN / #1 Top 500)



#### From 4 PByte to 100 kByte



Researchers at the German research laboratory Helmholtz-Zentrum Dresden Rossendorf are using Titan to understand and control new methods for particle acceleration that could have big impacts on laser-driven tumor removal.

Read the full story »



Projects



#### Spanning 6 orders of magnitude in time



 "cleaning" of temporal contrast with plasma mirror techniques

#### Features on sub-ps scale remain!



- Shot-to-shot fluctuation
- Experimentally accessible but still challenging to measure

**PICon** 

#### Plasma accelerators (2018 on CSCS Piz Daint, #3 Top 500)



That was in the olden days — things are surely much better now!

"Overall, this is an outstanding proposal. The High Performance Computing resources requested are appropriate. The PIs should try to **reduce the data requirements and try to find a solution that is technically possible for CSCS**."



#### Validating codes on the stomic scale



#### Towards higher energies





Plasma-Instabilities may degrade ion beam quality.

Clearly seen in experiment & simulation, but only simulations can provide atomic resolution of plasma dynamics....



#### Looking at plasma dynamics at atomic resolution @ HIBEF / EU-XFEL







#### Different instabilities create different scattering images





#### Inversion of data is hard – Learning from CERN



Each system imaged is a full High Performance Computing simulation



#### Data is coming



4 Beamlines generate 80 % of the data on GPFS



- Low-level software stacks for heterogeneous computing
- Data dependency and data flow descriptions
- Abstraction of communication and communication topologies
- A new way of thinking domain decomposition
- In-Memory workflow coupling
- Visual analytics combined with immersive UI interfaces, Machine Learning & Feedback
- Real time data fusion of experimental & simulation data and surrogate modeling



Human Data-intensive n the Loop Software Stack

#### **The Particle-in-Cell algorithm**



#### Domain decomposition in Super Cells





#### The Particle-in-Cell algorithm



#### Particle caching via Particle Frames



#### Abstraction Library for **Pa**rallel Kernel Acceleration



#### Single source heterogeneous many-core programming in C++



| Rank | System                                                                                                                                                                                                                                 | Cores      | Rmax<br>(TFlop/s) | Rpeak<br>(TFlop/s) | Power<br>(kW) |
|------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------|-------------------|--------------------|---------------|
| 1    | Summit - IBM Power System ACR22, IBM POWER9 22C 3.07GHz,<br>NVIDIA Volta GV100, Dual-rail Mellanox EDR Infiniband , IBM<br>DOE/SC/Dak Ridge National Laboratory<br>United States                                                       | 2,397,824  | 143,500.0         | 200,794.9          | 9,783         |
| 2    | Sierra - IBM Power System 5922LC, IBM POWER9 22C 3.1GHz,<br>NMDIA Volta GV100, Dual-rail Mellanox EDR Infinibund , IBM /<br>NVIDIA / Mellanox<br>DOE/NINSA/LLNL<br>United States                                                       | 1,572,480  | 94,640.0          | 125,712.0          | 7,438         |
| 3    | Sumway TaihuLight - Sumway MPP, Sumway SW26010 2600<br>1.45GHz, Sumway , NRCPC<br>National Supercomputing Center in Waxi<br>China                                                                                                      | 10,649,600 | 93,014.6          | 125,435.9          | 15,371        |
| 4    | Tianha-2A - TH-IVB-FEP Cluster, Intel Xeon E5-2692v2 12C 2.2GHz,<br>TH Express-2, Matrix-2000, NUDT<br>National Super Computer Center in Guangzhou<br>China                                                                            | 4,981,760  | 61,444.5          | 100,678.7          | 18,482        |
| 5    | Piz Daint - Cray XCS0, Xeon ES-2690v3 12C 2.6GHz, Aries<br>interconnect, INVIDIA Tesla P100, <b>Cray Inc.</b><br>Swiss National Supercomputing Centre (CSCS)<br>Switzerland                                                            | 387,872    | 21,230.0          | 27,154.3           | 2,384         |
| ő    | Trinity - Cray XC40, Xeon ES-2699/3 16C 2 3GHz, Intel Xeon Phi<br>7250 68C 1.4GHz, Aries interconnect, Cray Inc.<br>DOE/NINSA/L4NL/SNL<br>United States                                                                                | 979,072    | 20,158.7          | 41,461.2           | 7,578         |
| 7    | Al Bridging Cloud Infrastructure (ABCI) - PRIMERGY CX2570 MA,<br>Xeon Gold 6148 20C 2.4GHz, NVIDIA Tesla V100 SXM2, Infiniband<br>EDR , Fujitsu<br>National Institute of Advanced Industrial Science and Technology<br>(AIST)<br>Japan | 391,680    | 19,890.0          | 32,576.6           | 1,649         |
| 8    | SuperMUC-NG - ThinkSystem SD530, Xeon Platinum \$174.24C<br>3.1GHz, Intel Omni-Path , Lenevo<br>Leibniz Rechenzentrum<br>Germany                                                                                                       | 305,956    | 19,476.6          | 26,873.9           |               |
| Ŷ    | Titan - Cray XH7, Opteron 6274 16C 2.200GHz, Cray Gemini<br>interconnect, NVIDIA K20r, Cray Inc.<br>D0E/SC/Oak Ridge National Laboratory<br>United States                                                                              | 560,640    | 17,590.0          | 27,112.5           | 8,209         |
| 10   | Seguola - BlueGene/O, Power BOC 16C 1.60 GHz, Custom , IBM<br>DOE/NINSA/LLNL<br>United States                                                                                                                                          | 1,572,864  | 17,173.2          | 20,132.7           | 7,890         |

Abstraction Library for **Pa**rallel **K**ernel **A**cceleration Parallel, redundant hierarchy (CUDA, OpenCL, HIP)





- Grid whole parallel task
- **Block** fully independent part of the grid
- Warp group of synchronous threads
- Threads executed concurrently
- Elements sub-thread, sequential lock-step



**A**bstraction Library for **Pa**rallel **K**ernel **A**cceleration Parallel, redundant hierarchy (CUDA, OpenCL, HIP)





Abstraction Library for **Pa**rallel **K**ernel **A**cceleration Mapping the abstract hierarchy to real hardware







# Abstraction Library for **Pa**rallel **K**ernel **A**cceleration Alpaka Backends







#### Abstraction Library for Parallel Kernel Acceleration

#### Memory allocation and kernel call

```
// Init Host
using Host = alpaka::acc::AccCpuSerial< Dim, Size >;
using DevHost = alpaka::dev::Dev< Host >;
using PltfHost = alpaka::pltf::Pltf< DevHost >;
```

```
// Memory allocation
auto X_h = alpaka::mem::buf::alloc<float, Size>( devHost, extent );
auto X d = alpaka::mem::buf::alloc<float, Size>( devAcc, extent );
```

```
// Copy from host to device
alpaka::mem::view::copy(stream, X_d, X_h, extent);
```

```
// Kernel creation and execution
VectorAdd kernel;
auto const exec( alpaka::exec::create< Acc >(
    workDiv,
    kernel,
    numElements,
    alpaka::mem::view::getPtrNative(X_d),
    alpaka::mem::view::getPtrNative(Y_d)
));
alpaka::stream::enqueue( stream, exec );
```







#### Abstraction Library for Parallel Kernel Acceleration

#### SIMD optimized vector addition

```
struct DaxpyKernel
```

};

```
template< typename T_Acc >
ALPAKA_FN_ACC void operator()(
    T_Acc const & acc,
    double const & alpha,
    double const * const X,
    double * const Y,
    int const & numElements
) const
{
    using alpaka;
    auto const globalIdx = idx::getIdx< Grid, Threads >( acc )[0u];
    auto const elemCount = workdiv::getWorkDiv< Thread, Elems >( acc )[0u];
    auto const begin = globalIdx * elemCount;
```

```
auto const end = min( begin + elemCount, numElements );
```

```
for( TSize i = begin; i < end; i++ )
    Y[i] = X[i] + Y[i]; // Note difference between worker and data index</pre>
```





# Abstraction Library for **Pa**rallel Kernel Acceleration Zero overhead (DGEMM)



Less than 6% overhead compared to native DGEMM implementation

#### Abstraction Library for **Pa**rallel Kernel Acceleration



#### Zero overhead (Vector Addition)

#### Alpaka CUDA PTX

```
mov.u32 %r3, %ctaid.x;
mov.u32 %r4, %ntid.x;
mov.u32 %r5, %tid.x;
mad.lo.s32 %r1, %r4, %r3, %r5;
setp.ge.s32 %p1, %r1, %r2;
@%p1 bra BB6_2;
```

```
cvta.to.global.u64 %rd3, %rd2;
cvta.to.global.u64 %rd4, %rd1;
mul.wide.s32
                   %rd5, %r1, 8;
add.s64
                    %rd6, %rd4, %rd5;
ld.global.f64
                   %fd2, [%rd6];
                    %rd7, %rd3, %rd5;
add.s64
ld.global.f64
                    %fd3, [%rd7];
fma.rn.f64
                    %fd4, %fd2, %fd1, %fd3;
st.global.f64
                   [%rd7], %fd4;
```

#### Native CUDA PTX

```
mov.u32 %r3, %ctaid.x;
mov.u32 %r4, %ntid.x;
mov.u32 %r5, %tid.x;
mad.lo.s32 %r1, %r4, %r3, %r5;
setp.ge.s32 %p1, %r1, %r2;
@%p1 bra BB6_2;
```

| cvta.to.global.u64 | %rd3, %rd2;             |
|--------------------|-------------------------|
| cvta.to.global.u64 | %rd4, %rd1;             |
| mul.wide.s32       | %rd5, %r1, 8;           |
| add.s64            | %rd6, %rd4, %rd5;       |
| ld.global.nc.f64   | %fd2, [%rd6];           |
| add.s64            | %rd7, %rd3, %rd5;       |
| ld.global.f64      | %fd3, [%rd7];           |
| fma.rn.f64         | %fd4, %fd2, %fd1, %fd3; |
| st.global.f64      | [%rd7], %fd4;           |

# almaka

# Abstraction Library for **Pa**rallel **K**ernel **A**cceleration Heat diffusion simulation





29

# Abstraction Library for **Pa**rallel **K**ernel **A**cceleration CUPLA — CUDA2ALPAKA







Abstraction Library for **Pa**rallel Kernel Acceleration CUPLA – PIConGPU Plasma Simulation

Before: PIConGPU + PMacc 80k LOC After: 50k LOC

floating point efficiency

(20k in kernels) (1 year)

single precision 16% double precision 14% L2% **René Widera** 10% porting 80k LOC 8% in 3 weeks 6% 4% 2% 0% Interlagos Power8 Haswell K80 Native K80

upla





## Abstraction Library for **Pa**rallel **K**ernel **A**cceleration CUPLA – GAPD Diffraction Simulation









alsaka

#### In-memory coupling of two Alpaka-fied codes

## In-memory workflow coupling

#### openPMD Eco-System

github.com/openPMD/openPMD-projects

| openPMD standard (1.0.0, 1.0.1, 1.1.0)<br>the underlying file markup and definition |            |  |  |
|-------------------------------------------------------------------------------------|------------|--|--|
| A Huebl et al., doi: 10.5281/zenodo.33624                                           |            |  |  |
| hase standard                                                                       | extensions |  |  |

extensions pase standard general description domain-specific e.g. ED-PIC, SpeciesType, BeamPhysics

readers

writers & converters

coupled simulations, post-processing frameworks, ...

simulations, frameworks, measurements

e.g. PIConGPU, Warp, SIMEX Platform

#### native data tools

HDF5, ADIOS1/2, NetCDF, ... e.g. h5ls, h5repack, h5dump, bpdump

HDF Compass HDF5 & ADIOS file explorer open and explore file trees

#### openPMD-updater update to new standard

edit in- or new file

#### e.g. SIMEX Platform, Visit, yt-project, openPMD-viewer openPMD-api I/O library abstraction file format agnostic

data repositories exchange and long-time archival e.g. Zenodo, RODARE (HZDR)

CC







#### The bandwith hierarchy is killing us



#### In-memory workflow coupling





# **C++ JIT compilation and Jupyter Notebook integration** Cling and clang for Python-like C++ with GPUs and more





https://developer.nvidia.com/gtc/2020/video/s21588

#### Strongly-coupled visualization of data with ISAAC



#### Visual analytics combined with immersive UI, ML & Feedback



### Next up: Creating task graphs from data dependencies



REsource-based, Declarative task-GRAphs for Parallel, Event-driven Scheduling



### Next up: Creating task graphs from data dependencies



### redGrapes – Data flow much more complex than data dependencies



Next up: Parallelism needs performant memory access Low Level Abstraction of Memory Access



struct {
 float x,y;
} Pos;
Pos pos[8];
user code





memory



4 hierarchical Element Domains





### Next up: Parallelism needs performant memory access



Parallel object-like memory allocation & optimized deep copies



### Modularizing code becomes more important



### Exascale programming is not and should not be for everyone

#### **Interactive User Interface**

Python input control, in situ and post-processing, Browser Live-Rendering, Jupyter notebook integration, ...



### **PMacc**

hierarchical domain decomposition, data flow management & events, containers, common algorithms



# When going to Exascale, take babysteps inbetween Using Summit/ORNL as a testbed







PICon GPU

Weak scaling from 27 nodes to 4600 on the #1 HPC system Summit

# When going to Exascale, take babysteps inbetween



### But think before you simulate



arsaka

**PIConGPU** 

### **When going to Exascale, take babysteps inbetween** GPUs are pretty cool





| Specs        | Volta    | Ampére    |
|--------------|----------|-----------|
| FP32 cores   | 5120     | 6912      |
| Memory BW    | 900 GB/s | 1555 GB/s |
| VRAM         | 32 GB    | 40 GB     |
| Interconnect | 300 GB/s | 600 GB/s  |

PICon CED || al saka

### **Exascale System Readiness**

### ORNL Center for Accelerated Application Readiness (Exascale)



https://www.olcf.ornl.gov/caar/Frontier-CAAR/





### **Exploring Petabytes in real time**

## SAAC Visual analytics combined with immersive UI, ML & Feedback

Laser-driven Ion Acceleration with PIConGPU & ISAAC

Make your code Exascale-ready











Abstraction Library for **Pa**rallel **K**ernel **A**cceleration Meet us on Github!





# https://github.com/alpaka-group/alpaka



### www.casus.science











SPONSORED BY THE



STAATSMINISTERIUM FÜR WISSENSCHAFT UND KUNST

