FAA Outage and the Importance of Software Testing
On Saturday, the Federal Aviation Administration (FAA) experienced an outage that caused over 450 cancellations up and down the east coast. Early speculation points to a couple possibilities that might have caused the outage; either a recent software update that caused the system to crash or an automation failure.
In either case, the outage highlights the need for a disaster recovery plan that quickly recovers the production environment when something goes wrong. Studies show that human error, power outages, and software updates are the more likely cause of data center disruptions as opposed to natural disasters and weather related instances.
Disaster preparedness is a combination of many components, and for a system as critical as the one the FAA was attempting it is critical to formally test the upgrade in a test environment and have your rollback plan thoroughly tested.
For software as critical as the FAA was rolling out, there has to be an additional level of effort for the safest architecture for preventing surprises in production. Software rollout has to be systematic and should have at least three separate parallel environments:
Sandbox Testing: This is the first, more open to testing environment. You are trying to get the software installed and functioning. It is not intended to be for interoperability testing with other systems.
Quality and Interoperability Testing: This is a parallel environment that is maintained just like the production environment. This environment is an exact duplicate of the production environment. The only difference is production end users aren’t adding data to databases. This environment follows the processes and practices of the production environment, including the change control process. Software interoperability is tested for an extended period of time along with other production systems. This is where you discover issues with updates or automation steps that fail. Here is also where you test the rollback with Zerto Virtual Replication (ZVR) in case something goes totally wrong.
Production Environment: After the software is thoroughly tested in the first two environments, the production rollout should follow a more predictable pattern that emulates the rollout in the Quality and Interoperability Testing lab.
During the rollout planning, you define conditions that are triggers to rollback. If something doesn’t work as expected or what you saw in the Quality and Integration lab, it should trigger a rollback. This is not the time to be a cowboy or a hero.
Whether it was a system crash from a software upgrade or an automation failure, disasters do happen in the production environment and they are most often caused by human error. Being prepared for when they happen is as important to be ready for as the benefits you hope to gain with an upgrade. Production outage isn’t better than a functioning system even if it is outdated. Integrating ZVR in your software rollout testing is a great way to improve the quality of your software rollouts. With our hypervisor-based replication users are able to recovery to a specific point in time, before the system was corrupted, (or possibly in this case before the software was updated), and have their data center back up and running with aggressive RPO and RTO standards.