High Profile Government Outage Raises Complex Questions For All of Us
Following a series of very high-profile downtime incidents at the Australian Tax Office (ATO) the Australian Senate has requested the ATO provides a full list of every outage over the last 18 months. The ATO rightfully pointed out that a request on outages is not as straight forward as it sounds, stating “It’s important to note we don’t have one system – we have multiple systems, and often times it is a single system that gets impacted, not all of them”
Which brings up a very valid point around what is considered a systems outage. If some systems are running and others are not, does that constitute an outage? If an application slows down but is still available how do we classify that? And if some users can access your system but others cannot, do we report the system as “down”?
If we factor in uptime service-level agreements (SLA) it becomes even more complex. If a brief outage occurs but the SLAs are not broken, how should this be classified? In the case of the Australian parliament’s request to the ATO, perhaps those requesting the outage logs don’t quite understand the complexity. Although systems can be down services can however be running and ironically systems can be functioning but services can be down.
These questions are important for all of us to answer, and may change how we execute our approach to IT resilience.
Zerto has evolved application failover from a “fingers crossed” experience to a standard part of operations. This allows users to redefine what an outage might be. Outage definitions are inextricably linked to uptime SLAs. In this respect, they become absolutely critical when you are consuming IT or services from a provider.
Public Cloud and Uptime SLAs
If you are using public cloud to serve up applications to users or customers, you need to have a deep understanding of their definitions of an outage. Otherwise there is no way you can confidently offer corresponding SLAs to your own users.
At the highest level, the typical SLAs that we see from the major public cloud providers is 99.9 per cent uptime. This sounds great at face value, but this actually means 45 minutes of downtime per month. Of course, this isn’t guaranteed to happen, but it does mean that this can occur without your public cloud provider breaking their SLA agreement.
Public cloud uptime SLAs are also not completely straight forward. Typically, the SLA does not apply to the whole environment, rather it is applied component by component. This means that each application has a separate 99.9 per cent uptime SLA. So that over a one month, three applications could go down for 40 minutes each, but they do so at completely separate times. There would be two hours of partial downtime, but your cloud service provider has still not breached their SLA to you. The impact of SLA by application rather than by your total infrastructure is a very important distinction to understand.
Another way that a “component by component” SLA can affect you, is if your application fails but the storage used by that application does not fail, then the storage uptime SLA will not be deemed to be broken. This means that you will pay full service fees for storage even though the application linked to the storage was not active.
It’s also important to understand the SLA penalties that your public cloud provider is subject to in the event they breach. The typical self-imposed penalty that we see cloud providers offer for breached uptime SLAs is 25 per cent credit against your monthly service fee. Therefore, the only compensation from a public cloud provider for downtime is a reduction in the fee you pay.
To understand if this makes financial sense for your business, you must compare the cost of downtime to the cloud provider’s rebate, to work out if the SLA will work for you. However, for business-critical workloads, the likelihood is that public cloud uptime SLAs are unlikely to cut it.
The Devil is in the Detail
For your workload to be covered by your public cloud provider’s uptime SLA it has to meet certain criteria. Due diligence needs to be done to be sure that every workload you move to public cloud is covered by the generic SLAs they offer, you cannot take it for granted that they will be.
It is also difficult to pin point what responsibility SLA or penalty public cloud providers provision for total data loss. The following is taken directly from the terms and conditions from one of the major public cloud providers:
“You are responsible for properly configuring and using the service offerings and otherwise taking appropriate action to secure, protect and backup your accounts and your content in a manner that will provide appropriate security and protection.”
It is difficult to interpret this any other way than if data is lost it is your responsibility. The best thing to do is to clarify this with your cloud provider.
The deeper analysis of “all outage events” for the ATO will no doubt discover many other outages that did not have such obvious impact on public facing systems. The question that will need to be asked is whether these are significant and whether they need to be classified as actual “outage events”.
Outside of the ATO, these same questions need to be answered by CIOs and their teams across all industries. Holding service providers to uptime SLAs will need to be linked to how we define what an outage is. Even more important is understanding each of your service and cloud providers definitions so that you can be sure you choose providers that meet the finer points of your own uptime requirements.
Andrew Martin, Vice President APJ, Zerto
Andrew joined Zerto in January 2015 and is responsible for spearheading the growth of Zerto’s business in Asia. Prior to Zerto, Andrew was Vice President of Tandberg Data Asia Pacific, Japan and the Middle East, where he was responsible for developing the channel and OEM business in the region and managing all aspects of operations, administration, sales and marketing.