Tier3 data centres have become common place in our language giving (according to the uptime institute standards) a 99.982% uptime or 1.6 hours of annual down time. However any unplanned downtime causes a much bigger impact than 1.6 hours as recovery and switching of systems can take hours longer if not days. Recent issues at global airports have demonstrated that manual intervention and switching can cause bigger issues.
But are we paying over the market rate for availability or are we being lulled into a false sense of security by terms such as Tier3 and N+2 etc?
As we have seen fires floods and other common mode failures such as extreme weather events are becoming more common arguably driven by global warming. Although one high scale data centre may be tier3 or even tier 4 these acts of nature are a risk. However when we look at this high scale resilience the common approach is to build resilient power paths , resilient comms , then double all of the common infrastructure and then deploy additional spares.
With a microedge distributed technology there is natural in built resilience.
By taking (say) 5mw of compute power and distributing it across 100 sites across the UK naturally this spreads the risk of power supply single point of failure and other common mode failures. However it also makes great financial sense…..
For example if you have 5mw of capacity and have full resilience that means you have invested in 10mw of infrastructure (cables transformers etc). Of course as you get to smaller componets such as generators you can subdivide the “wasted” cost, ie you may buy 11 generators at 0.5mw each meaning you have 0.5mw of spare capacity (or wasted cost). As we get into smaller components such as servers the “units of space capacity get smaller” but of course we still have a 2nd site for complete disaster recovery. Many organisations have leaned towards two tier 2 data centres with something like 50% of spare capacity in either (the risk here is that if one site totally fails then the other site will become overloaded at peak times). Moving to a hyper distributed model we introduce the concept of “swarms” giving multiple instances of microedge compute resource. This means that failure of any one site may only impact 10% of your compute resource.