JUQUEEN Users Reimbursed for Lost Cycles
Severe technical problems caused long-term outages on our high-performance computing (HPC) systems in December 2016. The problems were related to the general parallel file system (GPFS) and affected all HPC systems connected to the central GPFS file server. A bug-fix sent in mid-December by the GPFS supplier appeared to have solved the problem until the file system crashed again during the Christmas holidays.
An error analysis revealed that the failure originated from the JUQUEEN HPC system. For this reason, the JUQUEEN system was taken out of operation on 24 December, while the other compute systems - after repairing the GPFS file system - were restarted in production mode. In the course of a comprehensive analysis after Christmas, the problem was eventually rectified on 4 January. To this end, a temporary solution was established on the JUQUEEN system while a new GPFS software version was rolled out on the GPFS file server and other related systems. A new GPFS version is currently unavailable for JUQUEEN.
In order to alleviate the impact of the extended downtime, especially on JUQUEEN, JSC decided to reimburse JUQUEEN users for lost compute time. Each JUQUEEN project will therefore receive one-twelfth of its originally allocated annual budget on top of its regular allocation. These additional resources will be spread out equally over the months February, March, and April 2017. The reimbursement has been made possible by adopting JUQUEEN resources.
We apologise for the significant inconvenience caused by the crash and the subsequent downtime, in particular on JUQUEEN. All HPC systems are back in production again and we are confident that the source of the problems has been identified and fixed.
(Contact: Ulrich Detert, u.detert@fz-juelich.de)
JSC News No. 247, March 2017