How Threat Stack Does DevOps (Part II): Engineering for Rapid Change
This was originally posted on the Threat Stack blog - added here for continuity.
Many organizations struggle with how and when to deploy software. I’ve worked at some companies where we had a “deploy week.” This was at least a week (or sometimes even longer) that was completely devoted to deploying huge amounts of software. The changes were so large and complex that deploying them would cause massive amounts of pain and suffering. It took hours every night for a week to deploy them, and it was too difficult to test all the changes one by one. So engineering and operations teams — not to mention customers — had to deal with broken updates until we could fix each one.
Additionally, because of the sheer volume of changes being deployed, the code was difficult to test. Systems would break in unforeseen ways, which led to distractions for engineering teams that would get called in to fix the issues. Imagine losing your entire engineering organization for an entire week every time you push out new software and updates! If this happens once a month, every month, it gets unsustainable fast.
Because I’d experienced this pain firsthand, I wanted Threat Stack to be different when it came to how and when we deploy code. That’s why we worked hard to embed DevOps best practices in our organization from the very beginning, starting with engineering for rapid change. In this post, I’ll walk you through what this means and why it is essential to doing DevOps well.
How Often Should You Deploy Code?
You often hear tech luminaries talk about “deploys per day” as if all companies should be aiming to deploy code dozens, hundreds, or thousands of times a day. Instead of focusing on volume, we believe that code should be (and can be) deployed when it’s ready.
Ready means different things to different organizations. Your company culture, business type, and security requirements should drive what “ready” means for you. For us, “ready” means that code has been:
- Reviewed by other engineers
- Put through a series of unit, integration, and functional tests
- Reviewed to ensure that it meets relevant business or security requirements
One of our early (as in pre-company launch) projects at Threat Stack was to ensure that code traveled through a continuous deployment platform, including providing engineers with an automated way to deploy code. We were (and still are) building a large, distributed application that processes massive amounts of data in real time. We are also a security company, so we naturally have to spend time thinking about how to improve our security posture. For those reasons, we needed to build a simple way for engineers to deploy their code when it was ready to go. It had to give engineers the ability to quickly and safely get their code to the right place at the right time.
What we built was a simple pipeline that integrated into our existing build tool Jenkins. This gave engineers an easy button that deployed their code to production when it was ready to go. That pipeline abstracted away all the complexity and took their code, built the package, signed it, and deployed it to our asset repository, allowing us to move faster and more securely than ever before.
Giving Engineers Responsibility for Readiness
As a bonus, this approach to releases also let our engineers take responsibility for the act of deploying the software when — and only when — it met our readiness criteria. Etsy, Netflix, and countless other software companies have been deploying code in this manner for a long time. We were not breaking new ground, except in the sense that we made sure that this was the first problem we solved.
Because of my previous experiences, I knew that if we didn’t make time during the pre-launch stage to optimize software deployment, we likely never would. This proved to be true: Shortly after launch, we were able to focus much of our time scaling to keep up with our customers, rather than figuring out how to speed up deployment.
Additionally, from an availability perspective, we are also interested in increasing developers’ knowledge of the system, so that it’s obvious when a release goes poorly. To enable this level of knowledge, we have focused on making sure it was simple for anyone to add metrics to their applications (both internal and external). For this reason, metrics and observability have always played a major role in platform updates for Threat Stack. (We’ll cover both metrics and engineering responsibility in depth in later posts in this series.)
Tooling for Optimal Deployment
Our early ideas about how best to deploy software have held up well as Threat Stack has grown. Our engineers still follow the same process to deploy their applications that we implemented in the early days. However, some of the tools behind the scenes have changed.
For example, in May 2017, we moved from Jenkins to Gitlab, and have enjoyed how tightly their source code management and build systems integrate. This has allowed us to add valuable elements to our build/release process, such as more unit testing, more integration testing, and continuous QA testing. We have even built tools to integrate with Gitlab and Jira to ensure that tickets are in the proper state before release. Finally, we have automated the process of ensuring SOC 2 compliance as well, using a tool we call sockembot. All of this helps engineers ensure that code meets our definition of “ready” before it is released.
How Engineering for Rapid Change Enables DevOps Success
You can’t say that you are successfully “doing DevOps” at your organization unless you are able to deploy code rapidly, securely, and automatically. That’s why engineering for rapid change is an excellent place to start when it comes to making DevOps a reality at your organization.
In the next post, we’ll take a look at how you can measure and optimize for system health to further reap the benefits of DevOps at your organization.