FOLLOW

FOLLOW

SHARE

5 DevOps Horror Stories That Will Scare Your Pants Off

5 spooky tales of tech gone wrong

31Oct

As the month winds down to a close, tricks and treats are in the air. AI and Machine Learning have undoubtedly grown by leaps and bounds within the past few years, but with such advances come unexpected consequences. Technology and connected devices are present in our homes, our workplaces, our cars, and our hospitals. They’re everywhere. This makes QA professionals and DevOps teams’ job more challenging than ever. With this in mind, many teams make it their mission to build a robust and sustainable product without becoming a casualty of poor processes.

In the spirit of the upcoming holiday, I have put together five spooky tales of tech gone wrong. Read up on my best practices and tips so you don’t become the next horror story!

#1: Bankrupting a company due to a failed deployment

In 2014, a company called Knight Capital lost $440 million in 45 minutes and disrupted the stock market, all due to a failed deployment. Due to technical debt (lingering code over 8 years old that was not in use) and a botched delivery procedure, new automated trading code made it on to 7 of 8 SMARS servers. However, the eighth server still had the old, erroneous code. When a repurposed configuration value activated the eighth server, it began making automated trades at lightning speed. It caused a disruption in the prices of hundreds of stocks and moved millions of shares. In addition, emails about the erroneous trades were sent to Knight staff, but they weren’t marked as urgent system alerts, allowing the problem to fester.

What can we learn from this fiasco? Releasing software should be a repeatable, reliable process that is free from as much human error as possible.

Along with using robust continuous deployment processes, having adequate test coverage can help prevent catastrophes like this one. The benefits of unit testing usually far outweigh the initial time investment.

#2: Making a typo and bringing down thousands of sites

Amazon’s AWS service powers a huge number of the largest sites on the internet. As such, when something goes awry with AWS, it’s very obvious. On February 28th, while trying to edit a small number of servers due to a billing issue, an 'incorrectly entered command' pushed the changes to a much larger set of sites. That’s right: a typo.

In order to fix the error, Amazon needed to restart the affected services. However, they had not been restarted in 'many years' and took longer than expected to restart and perform necessary checks, prolonging the outage.

Instead of blaming the developer, Amazon took action to review their deployment processes and opened the postmortem to comments for anyone within the company. When disaster strikes, proactive communication and working to prevent future snafus is often the best remedy.

#3: Not testing the backup/restore process until it’s too late

Earlier this year, Gitlab accidentally deleted production data while performing routine database maintenance. Just restore from a backup, right? Well, it turns out that their previous backup procedures had not executed correctly. And they hadn’t been testing them, simply trusting that the backup went off without a hitch. Needless to say, it was a huge wake up call for not only Gitlab but the tech community at large.

Backups only work if you routinely test your snapshots and restore processes before it’s too late. Create automated test pipelines for your backup and restore procedures, and run them every day, at least!

#4: Deleting a production database the first day on the job

A story that gained much media attention earlier this year was the tale of woe from a junior software engineer that accidentally deleted a production database on their first day. While this is a nightmare for any engineer, the incident raised other questions such as the robustness of the infrastructure and deployment processes at the company. If the developer was executing in a local test environment, why did they have access to the production data in the first place? What other security holes may exist?

#5: Pushing code with incomplete or nonexistent tests (especially when it breaks the build)

We’ve all been guilty of this at one time or another, but it doesn’t make life any easier for your friendly neighborhood QA team. We all know that writing tests improves code quality and maintainability, but it’s like eating vegetables or exercising regularly for a lot of people. Code with faulty tests or poor coverage makes it harder to find the errors and even harder to to fix them. They can cause the classic 'but it works on my machine' bug and break builds for the entire team. And no one wants that.

Testing doesn’t have to be a painful process if you use automated unit testing tools.

Conclusion

Everyone loves a good Halloween scare, but no developer wants to be caught off guard by brittle deployment processes, security holes, or failed backups. With testing services and tools, you will be one step closer to a less scary build process.

Sources

Eli Lopian is the founder and CEO of Typemock, the leading solution for unit testing and the first mocking framework for legacy code. He founded Typemock in 2005. With over 20 years of R&D experience at companies such as AMDOCS (NYSE:DOX) and Digital Equipment Corporation (DEC), he has established himself as a thought leader and expert in Clean Code, Unit Testing, Agile and Management.

Comments

comments powered byDisqus
Dark social

Read next:

How To: Exploit 'Dark Social'

i