With over 25.000 cores and hundreds of thousands of single components from four different HW companies (Intel, Mellanox, Bull, Sun), a system of this size and complexity is very likely to have failing components every day. As the repair actions take time and cannot always be done in production mode, a maintenance period of 9 hours every two weeks is currently absolutely necessary to avoid further problems and crashes. (Experience shows that repairing faulty components during production mode often influences the system up to a complete crash and should be avoided.)
These 9 hours are sometimes barely sufficient, because shutting down and bringing up such a complex system takes hours, especially when necessary check-outs are run, making sure a stable system is brought up again. In case unforeseen errors are detected during these check-outs, it is not unlikely that even the 9 hours period has to be extended (as it was the case on several occasions already).
It also should be noted that a supercomputer complex like Juropa/HPC-FF is not supposed to have an availability higher than 80-90% for general use (the lower value especially in its first year). Juropa/HPC-FF is definitely within this range.
The time slot for preventive and corrective system maintenance is scheduled
always on Thursdays from 8:00 to 17:00.
Please observe the system high-message to get more detailed information.
Users also can get any recent status updates by email, when they subscribe to the system high-messages as described at the bottom of the JUROPA/HPC-FF Highmessages page.
We know that Thursday is always a busy day of the week for users, but we have the restriction that we have to agree in advance with several companies the agenda of this maintenance day and this requires an exact planning of the availability of new hardware/software components as well as support persons available on this specific day. We do that on Monday to Wednesday.
(By the way: The negotiated service contracts allow us to do
- the planning and
- the lead-through
of the maintenance during daytime of normal working days only.)
Filesystem Situation (Lustre):
After lots of initial trouble with Lustre, we think we have reached a rather stable version with the filesystem software since November 2009. The problems we saw especially in January 2010 and also later, were all related to different hardware problems with the disk server storage or with sporadic Infiniband errors.
An inherent characteristic of this specific parallel filesystem - being a single point of failure - is that it is most vulnerable to any small problems with either hardware component affected (Infiniband, disk servers, disks, cables etc.). In a lot of cases the promised failover to redundant components does not yet work as advertised.