“This can’t be happening.” “This has never happened before.” “During my 30-year IT career, I have never heard of anything like this, let alone been involved in it.” On Thursday, October 20, 2022, a chain of events started in Saarni Nepton Oy’s data center. This chain of events should not have been possible. A normal, routine, safe, ten-minute maintenance procedure on one part of the data center caused an interruption to our service lasting several days.
On Thursday, October 20, 2022, a chain of events started in Saarni Nepton Oy’s data center. This chain of events should not have been possible. A normal, routine, safe, ten- minute maintenance procedure on one part of the data center caused an interruption to our service lasting several days.
Our customers include, for example, public administration, medical and financial organizations. We have tried to broadly identify various threats and error situations, so that we can prepare for them as comprehensively as possible.
We use three parallel, independent backup and recovery systems. The first saves the data every hour, the second and third once a day. In a big problem situation, we can recover data from a backup made earlier.
We regularly practice different emergency situations, starting with the worst possible scenario in which, for one reason or another, “the data center explodes.” The impact of the October event was almost in the same category, and in some areas, it was even worse.
The chain of events started with the replacement of a single non-critical part of the data center.
We have chosen the storage solution and its supplier from among the best in the industry. Our highly fault-tolerant solution ensured data always remains error-free and grants us 99.999% data availability rate in all situations.
So how did the impossible become possible? How can things go wrong in this way, when preparedness is on good level and recovery after different disaster scenarios has been regularly practiced? We have analysed what happened with international experts.
According to our analysis, the impossible can become possible if every step of an easy routine process goes fatally wrong. Investigating our chain of events has been compared to the investigation of a plane crash, where no single event causes an accident, but an accident occurs when enough improbable things happen at the same time. The outage would not have occurred if any single event would have been different. In our situation, the chain of unlikely events was the following one:
- An extremely unlikely component failure occurs in the storage solution.
- An extremely unlikely component failure causes malfunctions in the storage solution that the manufacturer is not prepared for.
- Data is located across several storage units, which we could call brain lobes. The maintenance technician chose the wrong brain lobe to operate on.
- There is a difference between capital and small letters! The critical command to shutdown one brain lobe would have required a small b instead of a big B. At the time of the incident, the manufacturer’s instructions on this were unclear, and as a result the maintenance technician wrote the “wrong command”.
- There is a software bug in the storage system that lets the “wrong command” through. As a result, the storage system performs the function in the wrong brain lobe. Totally wrong brain lobe is now accidentally shut down.
- According to the manufacturer’s best practices, the brain’s “cache” should have been turned off during the maintenance procedure. However, this was not done.
Four days of torment
The reason for the outage was shrouded in darkness for many hours. On Thursday evening, we formed an analysis based on the information provided by the manufacturer of the storage solution. The result of the analysis was overwhelming; “Data from all external and internal systems has been lost and primary backups have been corrupted.”.
The situation was extremely difficult. Our services were down, and we could not yet reliably estimate when they would be available. Experts around the world worked around the clock to find a way to fix this completely extraordinary situation.
Our own personnel worked non-stop with the manufacturer of the storage solution and with our customers for the next four days. Maybe we can at some point calmly remember and be less serious about these past events. For example, how our IT-specialist worked in the data center with maintenance technicians on three different shifts and slept for a few hours at the GLO-hotel before returning to the data center. Or how we participated in an 88-hour Zoom meeting with mission critical support from the storage solution manufacturer to understand what had happened and how we could recover our services.
A spark of hope
The manufacturer of the storage solution started the repair process on Thursday evening, where the destroyed data could still be restored from the storage solution or our primary backups. We were hopeful that we would be able to recover the data that was in our service at the time of the failure, or at most we would have to recover the data from 23 minutes earlier.
The storage solution manufacturer requested that they could perform a full data repair. We would receive information from them, possibly at any moment, that the repair has been successful, and our services could be started. As an alternative, we would have to recover services from secondary backups, in which case our customers would lose the changes they had made in the service during the last 17 hours. We hoped for the best.
The spark of hope faded on Friday, when the data repair runs were unexpectedly interrupted, and the storage manufacturer asked for permission to do the runs again. We gave this permission to the storage manufacturer. At the same time, we made the decision that our services should be available to customers latest on Monday. Despite trying for several days, storage manufacturer was not successful in correcting the data or primary backups, so we restored our services from secondary backups.
What was difficult about the situation was that there was not much up-to-date information available from the storage solution manufacturer’s team. If the manufacturer would have more clearly explained the situation, the amount of corrupted data and the improbability of repair, we would have started recovery from secondary backups earlier.
We did it!
Our recovery process and tools worked exactly as we had planned. We got our external services up and running on Monday 24 October morning. During the weekend, we prepared instructions for our customers and partners on how they should act in terms of data entry and automated integrations.
The contribution of our personnel during that unreal weekend, and after it, has been amazing. Both our IT experts, our product development, and our customer service professionals took control of the situation right away and got to work without thinking about working hours and the time of day. A big thank you belongs to all of them.
The moral of the story
Apparently, if everything goes wrong, the impossible can become possible.
I hope no one is ever trapped in an 88-hour Zoom meeting ever again. I encourage all organizations to prepare for the unexpected. To prepare for a situation where the decisions to be made are accompanied by great uncertainty. To think in advance when
and for how long to cling to the spark of hope, and when to accept the worst outcome and act accordingly.
We have an in-depth ongoing discussion with the manufacturer of the storage solution about their guidelines, best practices and the importance of communication that supports business solutions. Next, we will renew our entire backup & recovery system and investigate the possibility of storing recovery points very frequently.
We learned that recovery from disruptions cannot be just a technical exercise. We developed the models for how internal and external communication is carried out in such a large, unexpected disruption situation. We learned that communication should be more extensive and deeply integrated to the recovery plan. In particular, we learned that customers understand the exceptionality of such situations when the situation and problems are communicated openly.
Saarni Nepton Oy