Delta Down for the Count
In case you haven’t seen the news – or heard from friends and family complaining on Facebook – Delta passengers experienced a second day of flight delays and cancellations today due to a power outage in Atlanta yesterday.
Hundreds of flights were cancelled and problems were compounded by Delta’s inability to get their notification systems up and running – many passengers were not updated about their cancelled and delayed flights. According to yesterday’s Wall Street Journal on the Delta outage:
“An electric problem at its Atlanta headquarters occurred at 2:30 a.m. ET and the airline was forced to hold hundreds of departing planes on the ground starting at 5 a.m., according to Ed Bastian, the chief executive, who apologized to customers on a video…”
And the kicker,
“The technical problems likely will cost Delta millions of dollars in lost revenue and damage its hard-won reputation as the most reliable of the major U.S.-based international carriers, having canceled just a handful of flights in the most recent quarter.”
Massive revenue loss and damage to the brand in just two days. The question everyone is asking is – how is this possible? How can companies not have the basic business continuity and disaster recovery systems in place to handle a power outage? This seems like IT 101.
“The meltdown highlights the vulnerability in Delta’s computer system, and raises questions about whether a recent wave of four U.S. airline mergers that created four large carriers controlling 85% of domestic capacity has built companies too large and too reliant on IT systems that date from the 1990s. Delta merged with Northwest Airlines eight years ago.
These systems—which run everything from flight dispatching to crew scheduling, passenger check-in, airport-departure information displays, ticket sales and frequent-flier programs—gradually have been updated but are still vulnerable, IT experts said.”
And don’t get us wrong – we are not here to gloat. We have seen this complexity first hand from many of our customers. Legacy systems, lack of hardware interoperability, new software versions running alongside older versions. The bigger the company, the greater the complexity. Short of revamping everything, which would be prohibitively expensive, there are a few simple steps companies can take to make sure their BC/DR systems will work when needed.
- Find a BC/DR solution with complete hardware agnosticity. Make sure you can migrate and protect any applications or workloads across hardware brands, versions and technologies. This ensures that even after a complex corporate merger, which inevitably introduces new and different IT standards, hardware, etc. into the datacenter, the recovery systems will work.
- Recovery times must be low. An outage that took out Delta Airlines for 10 minutes would not have delayed thousands of people. In most cases, all would have gone back to usual. Companies need to have the lowest possible RPOs (Recovery Point Objectives) or the amount of data lost in a disaster – which is measured in seconds for the best recovery products on the market. And they need RTOs (Recovery Time Objectives) to be just minutes. Recovery time from the point of a failover should be under 20 minutes, as it was for this company we work with.
- TEST! We ask every prospective customer, how often do you test your recovery systems? Most are afraid to test on a regular basis. They simply don’t have the confidence a failover test will work. Our customers run monthly tests, and can run weekly tests if they choose. It’s simple to do and has no impact on their production environments.
Our take: companies need to stop thinking about disaster recovery as an insurance policy, and shift their thinking to always-available IT, something we refer to as ‘IT Resilience’. Organizations that embrace resilience, a more proactive approach to BC/DR, focus on continuous availability rather than recovery after the fact. Their confidence in the resilience of their infrastructure comes from the knowledge their business is “ready for anything” and that they can scale effectively, incorporate new technology easily, and above all, avoid disasters.