Cracking The Code Ep. 23 – CrowdStrike: The Largest IT Outage in History

On the latest episode of Cracking the Code, ClearDATA Founder and Chief Information Security Officer Chris Bowen discusses the recent CrowdStrike IT Outage and its implications on the healthcare industry.

What does the recent outage mean for business resiliency?

Watch and find out.

Don’t forget to schedule your Cloud Risk Checkup, powered by our cloud security posture management (CSPM) software, the CyberHealth™ Platform, and our team of highly trained experts.

Want to get in touch right away? Call (833) 992-5327.

FAQ

Transcript

0:06
Hello everybody, and welcome to Cracking the Code episode #23 where I am joined by Chris Bowen, Clear Data’s Founder and Chief Information and Security Officer.

0:16
We are going to discuss the recent IT outage as it relates to business resilience.

0:21
So hi Chris, how are you today?

0:23
Doing great.

0:24
Thanks.

0:25
All right, thanks for joining us again for another episode where we’re going to pick your brain.

0:30
All right, we can go ahead and dive right in.

0:32
So we wanted to talk about the recent global IT outage that is still impacting many other systems, airlines, especially healthcare.

0:40
So let’s dive right into question number one.

0:43
Can you give us an overview of what happened and how clear data responded to it?

0:48
Sure.

0:49
I watched the CNBC press conference, I guess the, the, the CEO almost in real time talking about what happened.

0:58
It, it, it, it was refreshing to hear the, the CEO of Crowdstrike talk about the fact that it’s, it was not a, a hack.

1:07
It was not a cyber event from a, from a, a threat perspective, but rather a really bad misconfiguration.

1:15
And, and what happened there, according to to the CEO was they, they put put up a content file and this content file had privileges within the kernel.

1:28
The kernel is a very high level set of technology that that can essentially command certain things within the hardware software, just a lot of power in that, in that kernel.

1:42
And what happened was, as all of you probably know, the, the, the logic started going into a, a recurrent loop and machines started exhibiting the blue screen of death.

1:57
Now, I’m a Mac guy, so I don’t necessarily know what that means anymore after 10 years of being on a Mac.

2:03
But every, every nearly, you know, every Windows machine that’s out there had some kind of an impact.

2:14
So you’re right, Natalie, we had airplanes staying on the ground.

2:19
We had, I, I was, I was driving and, and you know, those big billboards that would have, you know, digital billboards, they would have that blue screen of death on the, on the freeway.

2:32
You know, they just just mayhem.

2:33
And you wouldn’t think that’s a big deal.

2:36
But, you know, the fact is Crowd Strike did put out a, a fix in about 30 minutes, which was remarkably fast, really, really fast.

2:47
The challenge there was that even though the fix had been put out, that the damage had been done and millions and millions and millions of Windows systems around the world essentially cratered.

3:03
And so in terms of clear data, we put put up a, a quick service desk that focused on this.

3:11
We started that process, you know, Thursday night, we were, we manned it all all night and, and the next day and the weekend.

3:20
And of course then we reached out to our customers that may have been impacted as well.

3:25
Now what what we did learn was that none of our customers were impacted through the infrastructure that we have in place for our customers as well as the management plane and and all those other systems.

3:37
But some of them had some challenges because of the fact that they had to reboot and, and, and restart many of these machines.

3:45
I’ll be honest with you, some of these reboots around the the world, sometimes they had to do up to 1516, seven, 17 reboots to get this thing fixed.

3:56
And so we’re in healthcare especially, we’re still seeing a massive impact on, on some of these issues because of the fact that you don’t necessarily in, in this day and age, you don’t necessarily think that, hey, you might actually have to have a person go slot their way through a data center or somewhere to, to reboot a, a machine, a server.

4:22
And in this case, sometimes it’s 15/16/17 times.

4:26
And so just think about the vast number of servers, it’s Windows systems that had to be rebooted and you’re still seeing a massive problem, massive challenges at this point in time.

4:37
Absolutely.

4:38
I, I, I think that there’s some pride in the, the quickness within the fix.

4:42
Within a matter of minutes, you know, this fix was deployed.

4:44
And like you said, the repercussion of the manual updates, not many of the, the companies affected, few of the, the devices and the companies affected were able to deploy their own fixes within a matter of minutes.

4:55
And and like you said, we’re still seeing that fallout for sure.

4:58
Yeah, yeah.

4:58
And you know, let’s be honest, if we talk about some lessons learned here for a second, deploying a piece of content that can crater millions and millions of systems in one fell swoop, you know, maybe maybe the deployment of some of these updates or these changes needs to be staged.

5:23
I mean, if you think about what Apple does when they deploy a new iOS or you think about other companies that, that need to update systems throughout the world, what we’ve seen that has been successful is that they stage rollouts of these of these deployments.

5:41
In this case, one of the biggest failures was, hey, we’re going to put this in place.

5:47
They didn’t realize that the the mistake that was in there until it was too late.

5:50
And it happened instantaneously once that logic loop started to, to spin out of control.

5:58
And you know, the lesson there is stage, your stage, your deployments, for crying out loud, Absolutely no.

6:06
And, and we think about, you know, in terms of let’s dive into to lessons learned and, and what a warning this was.

6:12
And, and how much we, we rely on digital technology, especially in these major industries where we have to rely on them functioning perfectly, you know, not just a, a small mistake, but these, these major, important industries, they have to run without a hitch.

6:27
So it was a major global warning warning and underscores the need for robust cybersecurity and, and resilient strategies within, within companies.

6:36
And, and the, and the growing number of cyber threats and vulnerabilities are out there from, and you’ve spoken about it before.

6:42
These are not just individuals at their computers.

6:44
These are organized bad actors and everything.

6:47
So, you know, how did you manage internal and external communications during this and, and maintaining transparency with your own clients who might be coming back and questioning their cyber resilience and their own strategies?

7:00
Well, let’s be clear that this was not a cyber attack.

7:03
This was a mass, massive misconfiguration.

7:07
And you know, the, the tactics were not necessarily the best.

7:11
In my opinion.

7:13
It’s, it was sad because a lot of people had to skip surgeries, they had to miss planes, they had to, you know, just all kinds of of inconveniences and the toll on humans.

7:27
We, we will probably never know the toll on on humans themselves.

7:31
So it, it was definitely a, a thing.

7:33
How did we respond?

7:35
What one of the first things that I did was give our leader of marketing a call and say, Hey, we’ve got a, we’ve got to start talking to people.

7:46
We got to start identifying our our stakeholders.

7:50
Those include not only our customers, especially our customers, but they also include our board of directors.

7:57
They include our employees.

7:59
They include any of the of the regulatory bodies that might have anything to say about something.

8:11
We just had to we had to massively jump into a communications strategy and that’s one of the most important things in a crisis is, is communication and make making sure that we take a look at what’s happening in real time.

8:29
So, so those were the biggest things.

8:30
I think our marketing team did a great job on that.

8:34
Pat, Pat, you on the back, all, all of your marketing folks that did a good job.

8:40
But but thinking about going forward, let’s talk about how, how to recover.

8:49
And one of the things that we used to do at Clear Data early in our history, we used to comb the, the landscape doing audits for critical access hospitals.

9:01
It was one of the ways that we were steeped into the healthcare scene.

9:07
It was a necessary thing, especially high tech act came about and people were moving from paper to to digital.

9:14
So we had to go help identify critical systems.

9:18
So one of the things that is one of the tenants of the security rule under HIPAA is to have a, a, a APHI inventory and inventory your critical assets that your critical systems that have to be in place, they can’t go down.

9:36
And not surprisingly, if something like this were to happen and when it did happen, of course, those who have a, a criticality and analysis, an inventory of their critical systems can certainly go look at those first and redeploy them or reboot them or whatever they needed to instead of fiddling around with the periphery of hay, here’s a server that may not do anything, but let’s go reboot that and waste our time on that versus going after what’s critical.

10:09
So understanding your criticality analysis is a really important thing.

10:14
Thank you.

10:15
Absolutely.

10:16
And, and in the context of vendor evaluation and, and reflecting on your own role as a vendor, how did this change your view, if any, if it did at all, of your own practices evaluating your own vendors?

10:29
Well, we’re what, what our vendors have to meet are the same standards that we have to meet, which are defined by the covered entity, the hospital, the provider, the insurance company, etcetera.

10:42
So in terms of the vendor situation, nobody could have predicted this.

10:48
Crowdstrike was a very well respected company still is it, it was a very unfortunate incident that happened.

10:57
They’ve learned a lot.

10:59
I, I bet you, you’re gonna see staged rollouts of changes and, and tighter change management processes.

11:05
But I, you know, a vendor has to be interrogated, if you will, before it can ever touch something that’s, that’s going to impact a patient life or patient safety or even a system that could, you know, have downstream effects.

11:23
And so again, we know that vendors cause probably 83 percent or so of the misconfigurations and the outages that happen and, and the breaches that happen as well.

11:36
And you know, we just have to continue to work with our with our vendors, make sure that we do those regular assessments, kick the tires.

11:45
How are they doing?

11:46
Are they still doing what they say they’re they’re doing?

11:49
How is that going?

11:50
And so the vendor diligence is a very important part of the the equation here.

11:54
And, and you also participated in a webinar on mitigating the third party risks in vendor evaluation.

12:01
So we’ll go ahead and link that below the video resources here for your reference.

12:05
It is since it is just such a huge and important part of business resilience.

12:10
So I’d, I’d like to ask you a quick question about the role of regulators and what role do regulators play here and what role should they play in response to this recent outage?

12:22
Well, I think given healthcare’s challenges of keeping systems up and keeping the bad guys out, I, I think, I think the regulator should start to, to focus on resiliency.

12:35
I, I know that there is a, an effort afoot to, to make everything better in healthcare from an IT perspective, but I don’t see a lot of focus on doing your disaster recovery or your business continuity tests.

12:52
You know, folks like high trust Alliance, those guys, you know, they, they will say, hey, this is part of the control set that you have to have in place to be certified under high trust.

13:03
But you know, the, the loosey goosey nature of regulations around healthcare and some of which are just absolutely crazy voluntary measures that will never happen.

13:18
You know, we’ve, we’ve got to be more serious about resiliency.

13:22
And if, if we’re, if we’re not going to do the, the BC, the DRBC tests on a regular basis, shame on us.

13:31
We’re we’re not practicing the the drills, we’re not practicing the game.

13:36
And then when we get to into go time, you know, people don’t know what to do.

13:41
And so we just have to focus more on resiliency.

13:43
I think it’s one of those Achilles heels, I think of healthcare.

13:48
Absolutely.

13:49
And we’ve talked about those voluntary measures and you know, what is your business going to focus on if if something else is voluntary?

13:54
And and so setting those initiatives very, you know, promptly and clearly.

13:58
So.

13:59
OK, so we can go ahead and wrap up here soon.

14:02
Thank you so much for all of your insights so far.

14:05
So I, I, but just to wrap up with some advice centered questions.

14:09
So what, what key lessons would you or advice would you give to other organizations in order to to enhance their own resilience strategies?

14:17
You’ve mentioned a few of them, you know, Phi inventories, stage rollouts, all that kind of stuff, but would love to consolidate some of your thoughts right now.

14:25
Yeah, if you’re a vendor, you need to think about your legal implications, your consequences.

14:31
If you have an event like this, you need to understand, you know, the, the figures are what, 5.4 billion, those numbers are being floated around in terms of how much this is going to cost the world.

14:43
You know, the insurance companies are, you know, those of us are going through that cycle of, of insurance renewals.

14:51
You know, we’re going to, we, we’ve already seen the onslaught of emails saying, hey, do you remember the crowd strike events?

14:57
You, you need to think about this in your renewal.

15:01
Of course we, of course we know that we need to think about that in our renewals, but understand your cyber coverage.

15:06
What if you have an incident that happens like this?

15:09
What, how are you protected?

15:11
Have you offloaded some of that responsibility?

15:13
Reliability.

15:15
You need to define your claims.

15:17
You need to gather evidence as you go.

15:19
A lot of folks just respond to the incident and, and they forget about the evidence or they accidentally destroy the evidence, those kinds of things.

15:29
Again, we talked about the criticality matrix.

15:31
Do your do your security and business risk analysis after an event like this.

15:37
And, and there’s some more that we could point you to.

15:40
We’re going to put them in the on screen, I’ll link it, link to them as well.

15:45
We’ll put some more of these these things as well to make sure that you can just take a look at the playbook and, and go execute on it.

15:52
But we need to do that.

15:54
Well, Chris, thank you as always for your insights and your thoughts here.

15:57
Always appreciate it.

15:58
Join us next time, everybody, for episode #24 of Cracking the Code and have a great day.

16:05
Thanks a lot.

Don’t wait for cloud unknowns to become cloud nightmares.

Schedule Your Cloud Risk Checkup Today.

Request Checkup