How SRE Creates a Blameless Culture

People in tech love to praise failure as essential to innovation. But at far too many organizations, when failure happens, it is punished. When an IT system goes down, an individual or a team is immediately ID’d, blamed and shamed. This is counterproductive. Blame damages companies by creating a play-it-safe atmosphere and stifling innovation; even worse, when mistakes are made, they are often hidden.

Fortunately, more and more organizations are recognizing the damage that blame can do and embracing a blameless culture. This is a big shift, yes, but companies can make the move to blameless—and make it stick—if they bring site reliability engineering (SRE) best practices to their technology teams.

Why Blame is Bad

I’ve worked at several fast-growth companies that suffered under a culture of blame. At one, we were scaling quite rapidly. Change was constantly being introduced to the infrastructure, which meant that it broke all the time.

Because we had customers running their core business on our systems, our reliability requirements were very high—so high that for a while we stopped feature development altogether, in fear that new features would bring our systems down.

This was difficult enough, but the culture of blame made it worse. There was no psychological safety net and fingers were pointed constantly. There was an us-versus-them mentality—engineers versus non-engineers, developers versus operations, developers versus developers. Developers dedicated way too much mindshare to protecting themselves and their domain.

In a blame-rich environment, developers who should be delivering concrete results spend too much of their time in meetings trying to justify their work. There is a culture of retribution for mistakes. This leaves little room to learn and grow from failure because developers get harassed by management when they do make mistakes, or they get fired.

Your best people—those empowered to make decisions—see their responsibility undermined and go elsewhere. Creativity and innovation slow to a crawl. Meanwhile, competitors run ahead.

Why Blameless is Better

In a blameless culture, everyone feels safe and no one is afraid to make mistakes. It’s a psychologically secure environment, where true DevOps can happen. Developers feel confident enough to express their ideas and take chances. Development folks and operations folks collaborate well, and everyone is aligned on the problem.

The work environment is positive and workers are assumed to be doing their best. When mistakes happen, people aren’t blamed. Instead, any error is viewed as a manifestation of an underlying vulnerability in the systems. Attention is focused on fixing those vulnerabilities and, as a result, systems are constantly improving.

How to Get to Blameless

Let’s say you just had an outage and your database crashed. Don’t start by placing blame on a team or individual. Rather, start by conducting a postmortem. Ask questions and find the answers: Why did the database crash? Because Alex pressed the wrong button. Well, why? Because we didn’t have a system in place to enable automatic checking or a review process. So why did we not have that?

Now you’re getting to blameless and you’re reaping the benefits. You’re having a productive conversation and you’re focused on fixing your systems, not affixing blame. It’s not about Alex pressing the wrong button; it’s about how your organization is not set up to ensure success. It’s about this vital opportunity to learn from a mistake and get better. Don’t fire Alex. Put a solution in place so the problem doesn’t happen again.

How SRE Can Help

To get to a truly blameless environment, you need to implement SRE best practices. Why? Because SRE creates ultra-scalable and highly reliable software systems, by building in automation and tooling—moving the human element out of the picture and focusing on the systems.

One of the founding principles of the SRE movement is blamelessness. By implementing an SRE team and establishing SRE practices, your organization can move from the traditional operational model of break-fix to an environment that applies a development approach to IT operations, with the goal of improving reliability via automation, continuous integration and delivery.

For this to happen, you will need a SRE champion to quarterback your SRE effort. You will also need organizational buy-in to blamelessness. It has to be practiced top-down and have the full support of all leaders in the organization.

Tooling is critical, too. Look for tools and dashboards that can quickly do postmortems and find action items that need attention, that can identify the most commonly impacted services, products and customers, and the contributing factors that most often lead to incidents. The right tools will help you resolve incidents faster, by providing a clear overview of metrics related to incident identification and the actions that need to be taken.

You want to move fast and be innovative. But can you do it reliably? With a blameless culture, you can. You’ll have a safe environment that attracts and retains the best talent. You’ll move quicker, because you’ll be focused on fixing systems that break and finding out why—not the blame game. By delivering a higher degree of reliability, you’ll build greater trust with your customers.

You’ve heard the saying: Failure is not an option. In a blameless environment, failure is always an option—because it means that systems are always improving and innovation is always happening.

Ref: https://devops.com/how-sre-creates-a-blameless-culture/


Cẩm nang: Làm sao để phát triển và hỗ trợ các nhà quản lý