Cyber Monday Sale! 50% Off All Access

Instead of Dreading a System Crash, Schedule One and Learn to Avoid Them The best defense against outages is to rehearse for the worst and accept real incidents as an opportunity to improve.

By Nisha Ahluwalia Edited by Dan Bova

Opinions expressed by Entrepreneur contributors are their own.

According to a survey by CA Technologies, companies in North America and Europe lost more than $26.5 billion in revenue due to downtime, and that's from 2010!

There are various ways to calculate the monetary cost of system outages but the damage to a company's reputation is immeasurable. When Microsoft's Azure cloud-computing service experienced a major outage recently, experts speculated that it could be a major blow to the software giant's attempt to compete against rivals Google and Amazon.

Related: Safety Dance

Good CEOs and CIOs refuse to accept excuses for even small levels of downtime but it's not easy to hit five nines of reliability. Nonetheless, no matter how complex a company's systems and business, there are always ways to engineer and deliver higher reliability and quality of service. Below are the actions that CEOs need to take to boost their company's reliability:

1. Stop waiting for an outage. Create one.

If you wait for a customer to do something that causes a failure, you're too late. For example, Netflix has tackled unexpected outages using their "Simian Army," a set of automated tools that test applications for failure resilience. However, for most companies, the best way to handle this is to keep it simple.

Encourage your ops and dev teams to schedule a recurring meeting and create outages manually. Injecting failure reveals implementation issues that reduce resiliency while proactively uncovering deficiencies that would otherwise be the root cause of an outage.

Scheduled outages build a strong collaborative culture simply by bringing teams together on a regular basis. Working together to fix artificial failures will combat the idea that an actual failure can be ignored or justified with explanations.

2. Create (and protect) time for learning

No good engineer fixes the same problems without learning in the process. Make sure the teams responsible for resolving incidents have time to work through comprehensive postmortems.

Empower your teams to analyze what worked and what didn't, without forcing them to determine a root cause. All too often, human error is the focus of these conversations but that just isn't healthy. Blameless retrospectives allow teams to uncover the real issues and make proactive adjustments.

Businesses want to move fast but resist the temptation to move onto other issues when systems resume running or when everyone agrees on a "root cause." Invest the time needed to understand how your systems and teams work. See it as an opportunity for the contextual learning needed to make real-time decisions that will improve your company's mean-time-to-resolution.

Related: Does Your Website Have a Crash Plan?

3. Treat your ops and dev teams like sales and marketing. They drive revenue.

If you didn't support your sales teams with tools, training and incentives to hit their goals, people would think you were nuts. Despite their critical role in ensuring your customers are getting value from your company, ops and dev teams often get less attention than their customer-facing counterparts.

Give these employees the infrastructure and tools to achieve peak performance. That includes the latest operations management tools, time and resources for training and goals with incentives to meet them. If you don't provide them with necessary support and recognition, how can you expect them to deliver a high-value product with high availability?

4. Set a high bar for uptime

Even short periods of downtime have a material impact on your bottom line and market perception but once you're committed to supporting your engineering teams, you're in a much better position to set a higher bar for uptime. Build, buy or partner to get the technology and skill sets you need.

Unfortunately, many companies still use homegrown operations management systems without redundancy, and still use disparate tools and manual processes to meander through the incident lifecycle. A focus on reducing ops team costs instead of setting the right culture from the start simply doesn't make sense. The time spent on fixes alone will quickly become a greater cost for your company. Your product and services will suffer as a result.

CEOs who understand the importance of reliability in today's always-on world don't wait until there's an outage to improve operations. They don't ignore the rich learning that come from resolving incidents. They don't treat operations and development teams like the "back office." The CEOs of highly reliable companies invest in their operations infrastructure, processes and people because they care about the growth of their business and the loyalty of their customers.

Related: Go Daddy Outage: What You Can Do If Your Web Service Provider Goes Down

Nisha Ahluwalia

Vice President of Marketing at PagerDuty

Nisha is vice president of marketing, responsible for all things marketing including generating demand, building the PagerDuty brand and our community activities. She comes to PagerDuty with strong software-as-a-service experience, having built and managed several marketing functions at RingCentral and Cisco WebEx. Before she got into marketing, Nisha got her bachelors of science in Computer Science from San Jose State University.

Want to be an Entrepreneur Leadership Network contributor? Apply now to join.

Business News

Elon Musk Still Isn't Getting His Historically High Pay as CEO of Tesla — Here's Why

A second shareholder vote wasn't enough to convince Delaware judge Kathaleen McCormick.

Leadership

Leadership vs. Management: How to Understand the Difference and 6 Ways to Bridge the Gap

Here are the key differences between leadership and management, highlighting their complementary roles and providing six strategies to develop managers into future leaders.

Growing a Business

Her Restaurant Business Is Worth $100 Million — Here's Her Unconventional Advice for Aspiring Entrepreneurs

Pinky Cole, founder of Slutty Vegan, talks about going from TV producer to restaurant owner, leaning into failure and the value of good PR.

Legal

How Do You Stop Porch Pirates From Stealing Christmas? These Top Tips Will Help Secure Your Deliveries.

Over 100 million packages were stolen last year. Here are top tips to make sure your stuff doesn't get swiped.

Business News

'Something Previously Impossible': New AI Makes 3D Worlds Out of a Single Image

The new technology allows viewers to explore two-dimensional images in 3D.

Business News

'I Stand By My Decisions': A CEO Is Going Viral For Firing Almost All of the Company's Employees — Here's Why

The Musicians Club CEO Baldvin Oddsson fired 99 workers at once over Slack for missing a morning meeting. But there's a catch.