Lessons Learned from $DATA Incident in January 2021
On 26 January 2021, an improbable sequence of unrelated hardware failures in combination with a firmware bug happened, unfortunately leading to partial data loss on the $DATA file system at JSC.
After the system was brought into maintenance, a task force from JSC, together with the system vendor, software provider, and the disk and RAID manufacturers tried to restore and recover as much data as possible in a joint effort. A file system /p/largedata_restore/ was temporarily introduced, to provide access to data recovered from an unofficial backup that was created in January.
To avoid such a situation in the future, several new measures have been implemented. While $DATA was not backed up before – a full restore of this multi-PetaByte file system would take months to finish – $DATA will be split up into several smaller chunks, allowing for an implementation of a backup strategy. As an interim solution, JSC is performing a backup of the existing $DATA despite the well-known restore challenge.
Further information, together with a detailed timeline, can be found in the JUST system documentation on the $DATA incident.
JSC is very sorry for this unfortunate situation and apologizes for the inconvenience to those affected. The Data Services Support Team is happy to help if you have questions regarding the whereabouts of your data and will suggest potential additional steps as part of your recovery.
Contact: Data Services Support, firstname.lastname@example.org
from JSC News No. 280, 26 April 2021