Last week’s ‘#internetshutdown’, caused by an outage at content delivery network Fastly, demonstrated the importance of planning for failure, thinking about application reliability from a top-down perspective and setting a resilience strategy to combat fragile yet complex IT estates.
Thousands of sites were affected by the Fastly outage, including Amazon, Netflix, the BBC, the Guardian and Spotify. Whilst not being able to access your favourite show, news source or album for just shy of an hour might have been a mild inconvenience, more worrying was that the UK Government site gov.uk was also out of action.
Ultimately, the issue was resolved within 45 minutes, demonstrating the importance of observation and a fast recovery time objective (RTO).
But if that fix hadn’t been identified so quickly, it could have caused significant issues for a lot more people – the vast majority of whom now rely on being able to access Government services online and on-demand.
It is also interesting to note this event happened in the same week that Ofcom revealed that the pandemic drove us to spend more time than ever online.
New dependencies, new vulnerabilities
The Fastly network outage revealed the new dependencies and vulnerabilities that are emerging from the complexity of modern technology landscapes. Yet, while individual organisations have more complex tech stacks than ever, the vendor landscape is becoming more homogenous – meaning outages have the potential to impact end-users even more.
When time is money, limiting the damage (reputationally or technologically) that such an outage might have, however irregular it might be, is vital. Those organisations impacted by last week’s events are likely now weighing up any fallout and how they can limit the impact of such an event happening again – particularly if they are part of a regulated industry.
Interoperability and regulating resilience
The drive towards regulating resilience in such industries as the financial services sector seeks to precisely prevent this issue. If several large banks are using the same third-party provider of a service, and that provider fails, then what?
Fortunately, in the Fastly case, a fix was made within an hour – but should the opposite have happened, and it had taken days to resolve, it could have had a serious economic impact across the global financial markets causing regulators to question why backup options were not in place to protect the organisation and customers for such an eventuality.
We are already starting to see the regulation of resilience in the industry. The Financial Conduct Authority (FCA) recently published its final guidance on operational resilience in the Financial Services sector which comes into force in March next year and aligns with the EU’s Digital Operational Resilience Act (DORA).
Across both pieces of guidance, commonalities exist, namely in identifying any vulnerabilities in their operational resilience, firms are expected to have:
- Identified their important business services
- Set impact tolerances for the maximum tolerable disruption, and
- Carried out mapping and testing to a level of sophistication necessary to do so
Towards continuous resilience
These steps will allow organisations to think more holistically when it comes to considering the resilience of their systems. Certainly, we find that more complex infrastructures breed fragility, and so, for systems to be resilient they, by definition, need to become more elastic.
One way to achieve this elasticity is through orchestration. This approach cuts through the complexity of the landscape instead of adding to it.
Gartner calls this category of tooling the ‘Digital Platform Conductor’ – a new breed of tool that provides technology leaders with visibility of the hybrid digital infrastructure they have to ensure it delivers value.
- European cloud market to blast past $140bn by 2028
- The cloud is breaking ‘traditional’ security approaches, data suggest
- Cloud, optimisation and kindness: Responding to the pandemic
Only by having a complete overview of a system, can you accurately manage its performance and identify any weaknesses which may mean that the system needs to be shifted to rely on another architecture.
Given the uncertainly in which we have all lived for the last 18 months, businesses have generally become more adept at being prepared for the unexpected. However, this also means that they need to ensure their infrastructure is just as prepared should an outage occur.
Having complete oversight of a system will allow businesses to be equipped with the ability to spot an anomaly and take the appropriate steps to maintain uptime.
Join the Debate | Cloud First Summit
Cloudsoft is a sponsor at the upcoming Cloud First Virtual Summit on 23 June.
The conference will bring together senior technologists, Cloud architects and business transformation specialists to explore current trends, new advancements and best practice in Cloud computing.
Register your free place now at: https://www.cloudfirstsummit.com/