How Threat Stack Does DevOps (Part IV): Making Engineers Accountable
This was originally posted on the Threat Stack blog - added here for continuity.
Early on at Threat Stack, we focused on giving engineers the tools and ownership over their applications that would empower them to deploy and manage their applications in a safe way without causing customer downtime or other issues. As a small, but rapidly growing company, this is necessary for survival. For most of the last four years, Threat Stack has only had a two- to three-person operations team. With a such a small team, we understand that we can’t have our hands on everything that happens in production. It just doesn’t scale, especially given how difficult it can be to hire engineers is this competitive market.
In this post, we’ll take a look at how you can better scale your organization by employing the DevOps best practice of giving engineers fundamental responsibility for their code.
Teach Them to Fish
At Threat Stack, we create tools that make it easy for engineers to get their code to production. We also make it simple to track the performance of their applications over time. It only makes sense that we give them total ownership over their applications.
But this isn’t always how things are done, especially at organizations that don’t employ DevOps best practices. Often companies will restrict their engineering teams from accessing production. They’ll cite “separation of duties” as though it is inherently more secure to keep engineers out of production. There are, of course, always exceptions to the rule, and some highly regulated industries do, in fact, require a true separation of duties. However, if your goal is to keep engineers out of production because “they’ll break stuff,” recognize that you are likely running their code blind anyway. It would be much better to teach them how to take ownership of the health and security of their code.
“If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow.” — Carla Geisser, Google SRE
The part of this quote that I find the most interesting is the phrase “normal operations.” As Carla points out, your definition of normal will change as your systems scale and your company grows.
When we give our engineers access to production, we don’t give them a blank check to do whatever they want. But we do empower them to make decisions that ensure the health and security of their code. At Threat Stack, we know that giving engineers access to the systems required for this can be done safely.
How to Enable Responsibility Without Increasing Risk
We use role-based access controls to solve part of the security problem. For example, front end engineers aren’t given access to some of the back end distributed databases. Similarly, back end engineers won’t have access to other parts of the infrastructure that are not core to their jobs. This prevents mistakes and mitigates risk while still empowering engineers to take responsibility for their code.
Of course, that’s not to say that we don’t need to pay attention to what engineers are doing with production systems. One of the great perks of working for a security company is that we are able to use Threat Stack in a separated environment to protect our platform and verify user actions. Threat Stack gives us real-time visibility into user behaviors. This ensures that engineers are not doing anything that could affect the security or stability of the platform or our customer data without us knowing about it.
Many site reliability engineering and operations thought leaders have spoken out about how it should be a goal to restrict any access to production systems. I agree with this in theory, but it’s often not realistic. Eventually, tools such as Docker, CoreOS, and Kubernetes will move the industry towards a point where fewer and fewer users actually need “shell” access. These tools allow engineers to manage deployment of their applications without requiring elevated permissions on a host. In my opinion, stateful, distributed systems do not yet have the observability required to remove login access. Today, however, you may need to access the host running those systems to investigate problems or performance issues. In the near term at least, I don’t believe it’s yet possible to restrict all users from logging into servers (although it’s a great goal to aim for).
Improvements to Threat Stack’s role-based access control systems will eventually simplify the process of approving access. Our goal is to create a tool that will enable people to request access to a specific host or service. That process will notify the appropriate party via Slack, and the user can be granted time-based access. This will allow us to restrict access to times when engineers actually need it to debug or investigate a problem (similar to how companies like Netflix have tools like BLESS).
Additionally, we’ll have full logging to see who requested the access, and we’ll use Threat Stack to understand what they did on the host. Today, we already employ a similar process with API access to AWS. Engineers request access via a tool we’ve created and are given short, time-based credentials to access AWS and make changes.
Good Operations = Good Security
We believe that good operations makes for good security. Reducing the scope of engineers’ access to systems reduces the noise if we ever have to investigate malicious activity.
Additionally, by having our engineers request access and by only granting it for a short period of time, we reduce potential operator error and make the platform more stable. As a final bonus, our security team is also happy about reducing the threat of stolen credentials.
The Team Culture and Mantra We Live By
In the early days, Threat Stack’s engineering team and leaders built a culture around how engineering and operations team members should work together. In a nutshell, we believe that anyone should be willing to lend a hand to their coworkers when they need help. Even if you are not a subject matter expert. Even if you didn’t write the code that is having problems.
Even today, everyone on the engineering team works under this contract, because they know that at some point in the future, they too will need help. It sounds simple, yet somehow this is the first company I’ve worked for that has managed to instill this level of responsibility and teamwork.
Like many companies that have grown over extended periods of time, Threat Stack often finds that the engineers who wrote a certain piece of code are not always the ones who must later support it. For example, I am not an expert on Node.js or Scala, but when someone runs into a problem with these services, I will review logs, dive into the source code, and check performance graphs to see if I can identify anything unusual.
Part of what has helped Threat Stack be successful is the fact that our teams share the belief that we are all working together to achieve the same goals.