Report on causes and impacts of IT and datacentre outages

According to a recent Uptime Institute Report, while high availability and resiliency (outage prevention and effective recovery) is a priority for all involved in the digital infrastructure supply chain, progress is gradual, hard-won and, when failures occur, increasingly expensive.

The report summarises that a broad shift to distributed architectures, in which more IT functions run on standard IT systems, often distributed or replicated across many sites, reduces the impact of some localised failures. But this may also cause, at least during an extended transition, more network, software or systems issues.

It also notes that the transition to renewable energy and distributed energy generation and storage may reduce the reliability of the grid. While grid failures are not considered a primary source of outages, they will put stress on datacentre power systems and management processes.

Notably, the report emphasised the role of experienced and well-trained staff, who follow proven management processes, is critical to achieving resiliency. However, a skills shortage in many geographies makes it hard to find enough experienced staff.

What should be of interest to BICSI members, is the report’s reflection of the market perception that poor connectivity still contributes significantly to IT and datacentre outages.

While cyberattacks has become growing causes of outages over the past few years, the report noted that they only accounted for 11% of publicly reported/recorded outages in 2022, although rising from 8% in 2021. What results from cyberattacks however is often a lengthy shutdown of large parts of an organisation’s digital infrastructure. Because of contamination and loss of integrity, organisations often need to rebuild systems and data bases; data loss is common.

What should be of significant concern to us is that the report’s highest contributor to outages being connectivity at 27%, made up of “Fibre” at 18% and “Network (cabling)” at 9%. The report noted that “over [the past] seven years, third-party commercial operators of IT and/or datacentres (cloud/internet giant, digital services, telecommunications, etc.) accounted for 66% of public outages tracked since 2016, creeping up year-by-year — in 2021 the combined proportion of outages caused by these commercial operators was 70%, and in 2022 it was 81%.

“Of these, outages of telecommunications services show the largest increase, while reported cloud-service outages fell. This is partly caused by the high dependency of almost all services on telecommunications — but it is also because telecommunications services have moved from expensive, proprietary systems to more commodity-based components and architectures.”

The report is a good read for all ICT Infrastructure professionals, and can be downloaded for free from https://uptimeinstitute.com/resources/research-and-reports/annual-outage-analysis-2023

Note some comments in the introductory synopsis:

  • When outages do occur, they are becoming more expensive, a trend that is likely to continue as dependency on digital services increases. With more than two-thirds of all outages costing more than US$100,000, the business case for investing more in resiliency — and training — is becoming stronger.
  • Human error and management failures contribute to a considerable number of outages. More training and investment in management processes is required.