Southwest Airlines Saved From Global IT Outage Thanks To 32-Year-Old Microsoft System

I wouldn't call someone's decision "great" or "bad" if it succeed or fail just by chance.

You can't prevent all screw up, everyone new and old could make mistake and the main problem today is more of "why are they rolling out everything to every machine at the same time". Typically for large scale rollout they have to do it a few at a time and only when they solve issues they found as they scale, gradually roll out the rest. The assumption that they would not need to prepare for the worst at any time, whether you use Win 3.1, 11, OSX, Linux, mechanical system, does not matter.
 
I wouldn't call someone's decision "great" or "bad" if it succeed or fail just by chance.

You can't prevent all screw up, everyone new and old could make mistake and the main problem today is more of "why are they rolling out everything to every machine at the same time". Typically for large scale rollout they have to do it a few at a time and only when they solve issues they found as they scale, gradually roll out the rest. The assumption that they would not need to prepare for the worst at any time, whether you use Win 3.1, 11, OSX, Linux, mechanical system, does not matter.
I cannot speak for all Linux distributions but Debian gets pinged and disliked because it's roll outs and updates are "old". Debian is usually two years behind on roll outs and the long term releases are very long. Security update come rather quickly but are extensively vetted, until recently. Being open source has it's advantages. There are, literally, thousands of development testers out there.

Although capable, Debian does not do automatic updates. Users are notified of security vulnerabilities and patches but it's up to the user to implement them. I recently, against Debian's recommendation, configured my machines to do automatic updates (I've gotten lazy!). The only problems I've had with Linux distro's were rolling release versions.
 
I got this email from Delta today. I thought it was well-written.

PERSPECTIVE | DELTA NEWS HUB
An Update to Delta Customers from CEO Ed Bastian Like many companies worldwide, Delta was impacted on Friday morning by an outside vendor technology issue, which prompted us to pause flying while our systems were offline.



The pause in our operation resulted in more than 3,500 Delta and Delta Connection flights cancelled through Saturday. Cancellations continue on Sunday as Delta’s teams work to recover our systems and restore our operation. Canceling a flight is always a last resort, and something we don't take lightly.

The technology issue occurred on the busiest travel weekend of the summer, with our booked loads exceeding 90%, limiting our reaccommodation capabilities. I want to apologize to every one of you who have been impacted by these events. Delta is in the business of connecting the world, and we understand how difficult it can be when your travels are disrupted.



Please know that Delta’s entire team of the best professionals in the business have been working around the clock to safely get you where you need to go, and restore the reliable, on-time experience you've come to expect when you fly with us.



Specifically, the issue impacted the Microsoft Windows operating system. Delta has a significant number of applications that use that system, and in particular one of our crew tracking-related tools was affected and unable to effectively process the unprecedented number of changes triggered by the system shutdown. Our teams have been working around the clock to recover and restore full functionality.

We have issued a travel waiver for you to make a change to your itinerary at no charge. I encourage you to take advantage of that flexibility if possible. In addition, for those whose flights have been impacted, we continue to offer meal vouchers, hotel accommodations and transportation where available. And as a gesture of apology, we’re also providing impacted customers with Delta SkyMiles and travel vouchers. We will continue to keep you informed via delta.com and the Fly Delta app for the latest information on your itinerary.



I want to thank Delta’s employees, who have been working tirelessly across our system to restore our best-in-class operation and take care of you during a very challenging situation.



Thank you for your patience as we work through these issues, restore our operation and return to the reliability you expect from Delta.


©2024 Delta Air Lines, Inc. All rights reserved.
Delta Blvd. P.O. Box 20706, Atlanta, GA, 30320
 
Southwest might still be working on replacing the system, two years seems relatively short time to completely replace a system at an enterprise of that scale.

Having flown on the three majors lately - flying in general is IMO a hot mess. I avoid it if possible.

My take -

American - best app and best entertainment/WiFi, nice equipment. Too bad the DFW is the hub it's a train wreck.

Delta - next to best app, next to best entertainment (have to watch on in seat screens vs your device)

United - App apparently broken, impossible to access wifi or entertainment cant find reservations or save boarding passes to wallet... real dumpster fire in my opinion. passable equipment FA kept telling me to raise my setback which was raised. So either seat broken or FA broken.

Since no one can seem to manage an on time departure and arrival, I'll take American for the free movies if i have to. Overall if i can drive it in 13 hours i am not flying. And its not all the airlines - the passengers GEEZ what's unclear about one carryon and one personal item (though i suppose the airlines can be partially blamed for that because they don't enforce, charge 35 to check but will gate check for free if you get caught.) Constantly oversold "it's a very full flight today" is every flight.

Bonus point if anyone can explain why the entire plane load jumps up the second the plane stops THE DOOR IS NOT EVEN OPEN...
 
I disagree with just about everything
Southwest might still be working on replacing the system, two years seems relatively short time to completely replace a system at an enterprise of that scale.

Having flown on the three majors lately - flying in general is IMO a hot mess. I avoid it if possible.

My take -

American - best app and best entertainment/WiFi, nice equipment. Too bad the DFW is the hub it's a train wreck.

Delta - next to best app, next to best entertainment (have to watch on in seat screens vs your device)

United - App apparently broken, impossible to access wifi or entertainment cant find reservations or save boarding passes to wallet... real dumpster fire in my opinion. passable equipment FA kept telling me to raise my setback which was raised. So either seat broken or FA broken.

Since no one can seem to manage an on time departure and arrival, I'll take American for the free movies if i have to. Overall if i can drive it in 13 hours i am not flying. And its not all the airlines - the passengers GEEZ what's unclear about one carryon and one personal item (though i suppose the airlines can be partially blamed for that because they don't enforce, charge 35 to check but will gate check for free if you get caught.) Constantly oversold "it's a very full flight today" is every flight.

Bonus point if anyone can explain why the entire plane load jumps up the second the plane stops THE DOOR IS NOT EVEN OPEN...
you’ve said.

I’m flying on United and using the app right now. Works great. That’s how I am posting this.

We are left six minutes early on our flight to Boston. Smooth ride. Should land about 20 minutes early.

I posted the letter from Ed Bastian, while on a flight. I’m a Delta customer, as well.

I think I’ll sit back, relax, and watch a movie.

IMG_0311.webp
 
I disagree with just about everything

you’ve said.

I’m flying on United and using the app right now. Works great. That’s how I am posting this.

We are left six minutes early on our flight to Boston. Smooth ride. Should land about 20 minutes early.

I posted the letter from Ed Bastian, while on a flight. I’m a Delta customer, as well.

I think I’ll sit back, relax, and watch a movie.

View attachment 231574
I feel like a new thing is to overstate the flight time and gates etc and beat it by 30 mins. Happens often on Frontier, Spirit, United and Delta….

Unfortunately you tend to have to wait for gate…
 
Although capable, Debian does not do automatic updates. Users are notified of security vulnerabilities and patches but it's up to the user to implement them.
unattended-upgrades is a very popular package in Debian and is installed and activated by default in Ubuntu Server. It is configured out of the box to only grab security updates, which is where most people leave it. With that said, it is still a best practice and common sense to manually grab any update and test it prior to rolling out to production!
 
I started with DOS in 1983 so I know a thing or three about how DOS and Windoze evolved. Windows 98 was a very stable OS but was asked to run on very palty and buggy hardware at the time. It actually had very very few exploitable vulnerabilites. Win 2K was "for business" but Microsoft panicked when napster and file sharing took off and rushed to replace 98/ME with XP. 98 didn't require remote activation, neither did 2K or ME, but Micro$haft built remote activation into XP. XP wasn't ready for general deployment as a consumer OS, it was from day 1 a trojan-hosting platform and was the reason why spam was able to become an industry (direct-to-mx SMTP spamming through infected / trojanized XP systems). By 2014, after 14 years, XP was finally locked down sufficiently to be somewhat safe to connect to the internet as 98.

I still run Win-98 on some systems, bolstered by KernelEx API enhancements created by a dedicated 98 fan base. I can run relatively modern browsers on it, for example, and also have up to 4 gb of usable ram.
I was lucky back then (XP days) My brother worked IT for a University, and they had their own activation server.
he gave me a Copy of their XP Build, which was licensed to the university.
Me and a Couple friends used that Version until 7 came out.

and a company that he more recently worked for, a well known manufacturer, had one of their Critical systems ( that their Dealers used to order/inventory parts, etc) was still running on I believe Server2000(?) there was one guy who had a massive collection of Documents/notes/etc. explaining how he managed to bandaid it enough to keep working, but he'd never let anyone else see his work. he built it and was the only one who could maintain it.
 
Last edited:
What a horrible article.

They mix up Microsoft and Crowdstrike multiple times and quote the Microsoft CEO as Crowdstrike CEO.

This issue really had nothing to do with Microsoft. Pretty much all anti-malware or EDR software runs as a kernel mode driver… it has to, in order to work effectively.

This is on all platforms. Apparently, a similar issue existed with Crowdstike on Linux recently, it just happened to be a less serious bug. And this has happened with other anti-malware software in the past. McAfee is notable one - especially because apparently the CEO of CrowdStrike used to be the CTO of McAfee when it happened back then.

Microsoft has issues, Windows has issues, yes, but this same thing to happen to any vendor on any platform.

The only thing preventing such issues is quality control and good management. Things thought be thoroughly tested, then such updates should be rolled out in phases. First 1% of systems should be updated, if no reports of issues come back then 10%, then if no issues after a while, then do the rest, for example.

The very nature of a managed solution like CrowdStrike means it’s out of the hands of IT people, sysadmins, etc. at companies that use these products. It’s one of the main selling points. But then companies rely on the vendor to do a good job. In this case, they screwed up.

Fixing this issue was especially messy because of how complicated it was if you use Bitlocker on the server or PC. Which is a useful, important security feature that’s otherwise very beneficial. But in this case, that security feature complicated recovery massively.

None of these things are really Microsoft’s fault. Although I do think it would be cool if Windows detected let’s say 5 failed boot ups due to a certain kernel level driver it would try to boot up without it. Not a perfect solution but it would have made this outage massively less impactful and expensive. With all the telemetry and diagnostics built into Windows, it would theoretically be possible. Of course, that would have its own downsides and issues.
 
I got this email from Delta today. I thought it was well-written.

PERSPECTIVE | DELTA NEWS HUB
An Update to Delta Customers from CEO Ed Bastian Like many companies worldwide, Delta was impacted on Friday morning by an outside vendor technology issue, which prompted us to pause flying while our systems were offline.



The pause in our operation resulted in more than 3,500 Delta and Delta Connection flights cancelled through Saturday. Cancellations continue on Sunday as Delta’s teams work to recover our systems and restore our operation. Canceling a flight is always a last resort, and something we don't take lightly.

The technology issue occurred on the busiest travel weekend of the summer, with our booked loads exceeding 90%, limiting our reaccommodation capabilities. I want to apologize to every one of you who have been impacted by these events. Delta is in the business of connecting the world, and we understand how difficult it can be when your travels are disrupted.



Please know that Delta’s entire team of the best professionals in the business have been working around the clock to safely get you where you need to go, and restore the reliable, on-time experience you've come to expect when you fly with us.



Specifically, the issue impacted the Microsoft Windows operating system. Delta has a significant number of applications that use that system, and in particular one of our crew tracking-related tools was affected and unable to effectively process the unprecedented number of changes triggered by the system shutdown. Our teams have been working around the clock to recover and restore full functionality.

We have issued a travel waiver for you to make a change to your itinerary at no charge. I encourage you to take advantage of that flexibility if possible. In addition, for those whose flights have been impacted, we continue to offer meal vouchers, hotel accommodations and transportation where available. And as a gesture of apology, we’re also providing impacted customers with Delta SkyMiles and travel vouchers. We will continue to keep you informed via delta.com and the Fly Delta app for the latest information on your itinerary.



I want to thank Delta’s employees, who have been working tirelessly across our system to restore our best-in-class operation and take care of you during a very challenging situation.



Thank you for your patience as we work through these issues, restore our operation and return to the reliability you expect from Delta.


©2024 Delta Air Lines, Inc. All rights reserved.
Delta Blvd. P.O. Box 20706, Atlanta, GA, 30320
Unfortunately- despite how well-written the letter might have been, Delta has failed to recover.

It’s been ugly. Still canceling hundreds of flights per day.

Prompting another apology to customers.


Since the CrowdStrike outage late last week, Delta’s team of the best professionals in the business has been working around the clock to restore the reliable, on-time operation you’ve come to know and expect when you fly with us.

While our initial efforts to stabilize the operations were difficult and frustratingly slow and complex, we have made good progress this week and the worst impacts of the CrowdStrike-caused outage are clearly behind us. Delays and cancellations were down 50% Tuesday compared to Monday, and we anticipate cancellations Wednesday to be minimal. Thursday is expected to be a normal day, with the airline fully recovered and operating at a traditional level of reliability.

I know the last few days have been difficult. To our customers who were impacted, I want to thank you for your patience and apologize again for the disruption to your travel.

We understand how important travel is in your lives, and we remain committed to taking care of those whose flights may still be impacted, with meals, hotel accommodations and ground transportation offered through vouchers and reimbursements. We’re also providing impacted customers with Delta SkyMiles and travel vouchers as a further gesture of apology.

I also want to extend my thanks and gratitude to Delta’s amazing team of 100,000 aviation professionals, who have been working tirelessly to take care of our customers and ensure their safety in a challenging operating environment.

We will continue to keep you informed via delta.com and the Fly Delta app for the latest information on your itinerary.

I’ve received emails from many of you who are understandably frustrated with the pace of progress and the difficulty in getting the service you deserve. I’ve also received many notes of encouragement and support commenting on the heroic efforts of our people, who are working under trying and stressful conditions. Thank you for your feedback, as well as your patience and understanding.
signature_Ed.png
 
In the meantime, my daughter flew UAL yesterday morning. ORF-IAD-BOS.

Full flights. A few minutes early with each arrival.

A pleasant experience with an airline that is running well.
 
In the meantime, my daughter flew UAL yesterday morning. ORF-IAD-BOS.

Full flights. A few minutes early with each arrival.

A pleasant experience with an airline that is running well.

They all have their meltdowns, I'm sure you remember United's from June 2023. American is long overdue for one, I think they're the only major carrier that hasn't had one post pandemic.
 
unattended-upgrades is a very popular package in Debian and is installed and activated by default in Ubuntu Server. It is configured out of the box to only grab security updates, which is where most people leave it. With that said, it is still a best practice and common sense to manually grab any update and test it prior to rolling out to production!
Well now they found out that even security patch can be dangerous if deployed wrong. The thing really is, you either have to "not put all your eggs in one basket" and take the risk that some eggs will more likely to be broken, or you have to focus and guard all the eggs in one basket and take the chance of an "all broken" incidents.

Maybe in the future each commercial PC should be redundant and if things fail you can just boot into the other one and continue. It will cost more (double?) and it may not sit well with the accountants, but this incident is a wake up call to the non redundant nature of our systems.
 
Well now they found out that even security patch can be dangerous if deployed wrong. The thing really is, you either have to "not put all your eggs in one basket" and take the risk that some eggs will more likely to be broken, or you have to focus and guard all the eggs in one basket and take the chance of an "all broken" incidents.

Maybe in the future each commercial PC should be redundant and if things fail you can just boot into the other one and continue. It will cost more (double?) and it may not sit well with the accountants, but this incident is a wake up call to the non redundant nature of our systems.
Agreed: Even Debian, just a few months ago, issued a (security) kernel update that had present in it an ext4 bug. By the time I'd seen the notice to NOT install the update, my dutiful unattended-upgrades had installed it and rebooted, as per my config. Luckily, by that time another minor kernel bump was available and another apt-get upgrade and reboot was all that was required, and no data was lost. For a short while I was beginning to prepare a disaster response to restore my most recent system snapshot!

After that, I resolved to pay closer attention to the Debian-security mailing list and maybe use the Blue/Green system for more than major version OS upgrades.
 
I think the problem is "group think." Why are all these companies using the exact same software? Did they all go to the same school? seminar? club? Doesn't Crowdstrike have any competition?
 
Maybe in the future each commercial PC should be redundant and if things fail you can just boot into the other one and continue. It will cost more (double?) and it may not sit well with the accountants, but this incident is a wake up call to the non redundant nature of our systems.

I do not see a way to justify paying twice the hardware, licenses, and maintenance for rare events. I bet the backup computers would probably be on crowdstrike as well.

I think the problem is "group think." Why are all these companies using the exact same software? Did they all go to the same school? seminar? club? Doesn't Crowdstrike have any competition?

Palo Alto, Sentinel One, Bit defender, there's lots of XDR programs but there's only a few that large SOCs use. Normally grouping industries is a good thing because if one company gets hit, the XDR company will have experience of another similar company gets hit.
 
Back
Top Bottom