Home » Crowdstrike: How a single point of failure brought down key services globally.
By Cian Fitzpatrick | 14th August 2024
The speed at which technology has evolved and the vastness of its implementation, evokes a sense of end-of-the-world-ness and Armageddon.
Networked systems and the internet are the very foundation of our functioning world right now. Crucial services are dependent on the internet and the cloud, from healthcare to money, from transport to communication. We’ve come to rely on technology so completely in such a short time, that there’s a certain uneasiness, which was highlighted by the events of 19 July 2024.
We wonder: will modern warfare take the form of a massive cyber-attack that brings society to its knees? That’s impossible, some say. But is it really? The cyber-event of 19 July 2024, where critical services were brought down completely, gave us a reminder of how reliant we are. The impact of the Microsoft/Crowdstrike incident was extensive, expensive and frightening. And while it’s somewhat comforting knowing that it wasn’t a malicious attack, it’s also terrifying that one small piece of buggy code was at the heart of it all.
Because of the scale of disruption, from PCs to servers, from cloud-computing infrastructure to core systems, this is the biggest digital disruption in history. So far. We’ve seen malicious code cause chaos around the world: the denial of service attack, SQL Slammer worm; Russian malware attack on the Ukraine, NotPetya; North Korea’s ransomware attack, WannaCry.
Ironically, this incident was triggered by software designed to stop such hacking attacks.
Microsoft’s all-encompassing reach into computing infrastructure was at the core of this disaster. The question is: how does one piece of faulty code evolve into a mass event like this?
Speaking of organisational technology systems, Drew Bagley, a CrowdStrike VP said: “Their IT stack may include just a single provider for operating system, cloud, productivity, email, chat, collaboration, video conferencing, browser, identity, generative AI and increasingly security as well.
“This means that the building materials, the supply chain and even the building inspector are all the same.”
The moral of this story is that no single system – and certainly not all of them – should be susceptible to a single point of failure with a single provider. This was a perfect example of placing all one’s eggs in one basket. Where the basket was also an egg.
This event will have shocked many organisations into reviewing their IT strategy, particularly their cloud architecture. They will be looking for resilient platform providers, that themselves do not rely on a single infrastructure.
Distribution across multiple datacentres and datacentre providers will be a key selling point for vendors, particularly those in the security space. For a juggernaut like Microsoft/CrowdStrike to make this error is not just deeply embarrassing: the damage has been immense, and already there are lawsuits and financial claims against CrowdStrike.
With the luxury of hindsight vision, we know this to have been a rookie error. But Microsoft/CrowdStrike would not have been the only organisations with this lack of resilience in the architecture of their products.
Security products in particular must be robust if we’re to stay on top of cybercrime. Although all flavours of crime always have been and always will be a constant in this human existence, as developers, we need to stay ahead of the curve.
Email continues to be the primary conduit of data theft and malware, placing it in the limelight of both criminals and cyber experts. Email filtering plays a crucial role in cybersecurity, preventing malevolent content from even entering an organisation.
Topsec offers cloud security and effective email filtering, ensuring continuity even in the event of failure at any given point across cloud providers. Our core infrastructure replicates to independent stacks throughout our environment, ensuring that our mail filtering continues to flow – there is no single point at which a collapse could occur. This means that remaining stacks and multiple providers can pick up the pace and fill in any gaps in the event of a single or multiple provider outages. Over-provisioning each stack allows additional mail flow to spread across the entire system and condense into fewer stacks if required.
Round-robin (RR) scheduling is one of the algorithms employed by process and network schedulers in computing, as explained on Wikipedia. As the term is generally used, time slices, also known as time quanta, are assigned to each process in equal portions and in circular order, handling all processes without priority. RR is also free of resource starvation, a “problem encountered in concurrent computing where a process is perpetually denied necessary resources to process its work.”
As an operating system concept, RR is simple, easy to implement and can be applied to other scheduling problems, such as data packet scheduling in computer networks.
Optimising for availability means using the maximum available isolated stacks to handle the overall workload.
Optimising for equal distribution allows for the ideal distribution of the overall workload to ensure that no single point is pushed to maximum capacity, which could lead to failure.
Multiple providers and spreading the provider base allows for increased redundancy and management of incidents without downtime.
Fast geolocation deployment means that Topsec can quickly and efficiently deploy new stacks when required to locations globally, as customer requirements demand.
At Topsec, we supply a resilient platform that isn’t weakened by a single point of failure. When designing our cloud architecture, the team at Topsec deployed an email filtering service, distributed across multiple datacentres and datacentre providers.
Contact us for more information about how we can help your organisation mitigate cyber risks.