How Threat Stack Does DevOps (Part III): Measuring and Optimizing System Health
This was originally posted on the Threat Stack blog - added here for continuity.
One of the most important things that any company can do to benefit from DevOps is define and implement useful, actionable metrics for visibility into business operations.
This is already standard practice in most areas of the average organization. KPIs drive sales and marketing teams, finance groups, and even HR. Yet, at many companies, having metrics for the application that brings in the money is an afterthought — or is not prioritized at all.
In this post, we’ll take an in-depth look at why application and infrastructure metrics should be baked into your engineering organization as early as possible, how to do it, and what tools can enable your success around this key area of DevOps.
Why You Should Prioritize Metrics
Many years ago at a previous company, I ran a large infrastructure team with thousands of customers, and we had zero telemetry. Without any kind of instrumentation, we were flying blind. Based on that painful experience, I made sure from the earliest days that Threat Stack prioritized defining and collecting metrics. We didn’t have the time or people to build our own metrics platform in-house, so we adopted Librato. They made it very easy to get quick insights into the health of our platform.
A few months later, when we launched the Threat Stack platform, the metrics we spent time defining really started to come in handy, as they helped us scale our platform after launch.
Maybe you have heard the quote, “No battle plan survives first contact with the enemy.” Well, no product survives first contact with the customer, either.
In other words, for readers who have not had the pleasure of working on a new product release, the only certainty is that something will fail. Failure is the reality of software and computers. Specifically, with a brand-new system, you have no understanding of:
- Key load patterns
- Scaling points
- What will fail
- Real-life customer usage patterns
You can’t do a lot to prepare, and your best bet is to identify problems quickly, and fix them as soon as possible.
For this reason, in those early days, tools like Librato gave us invaluable insight into the health of the platform. As operations engineers, we invested time in teaching the rest of the engineering organization how to use those tools. Using a hosted metrics provider allowed us to focus our limited time on making the product experience better for our customers, rather than scaling our metrics collection.
When to Invest in Metrics Collection
As the years went on, the cost of a hosted metrics provider became prohibitive, because we needed to collect more granular metrics. Fortunately, by that time we had enough engineering expertise and time to invest in building our own metrics collection system.
We chose Graphite to address this challenge for a few reasons. For starters, Graphite has been around long enough to have a large community of people with experience scaling it, and it has active developers. I also had experience scaling Graphite systems. Knowledge and comfort shouldn’t be the only reasons to pick a technology, but if it also meets your goals, they can be a bonus.
We also started using tools like collectd and statsd to make it easy for engineers to add telemetry to their applications. We provided simple tools so they could get their metrics to Graphite as easily as possible. We also conduct regular training sessions to teach people how to use tools like Grafana to create graphs and dashboards with their data. Ensuring that the engineering team knows how to get value from their metrics is a win for operational visibility.
Now, when engineers ship new projects or features to production, they produce their own metrics and dashboards. This allows them to perfectly customize their view into their application to understand how it responds when they are deploying new features or other improvements. It also helps everyone in the business have a place to go to understand a scaling event. This way, you don’t have to be an expert in any particular part of the application to start investigating a problem.
Today, our metrics infrastructure collects approximately 55,000 metrics per second. One interesting side effect of making such a large investment in time series metrics is that we don’t depend heavily on tools like Sumo Logic, Splunk, or the ELK stack for environment visibility. When engineers want to know more about what their application is doing, they add some statsd counters and can quickly visualize the answer.
How to Gain Deeper Insights
Of course, not all useful data fits into a neat, numerical metric. For example, log data can also provide helpful insight into the platform. Last year, we implemented Graylog to collect, monitor, and alert on those logs in a more structured manner. In certain cases, we’ve created “streams” that allow product and support teams to view important log data. This is an example of how we have built our technical operations team as a self-service organization. Our TechOps team builds the tools that enable the rest of the teams, from engineering to QA, to be effective in their roles.
More recently, we’ve investigated new application tracing tools like Zipkin, hoping to obtain even deeper insight into requests between applications. Tools like Zipkin and other tracing software offer insight into an entire request as it navigates through our platform. This helps us see when customers make API calls into our infrastructure and monitor for performance issues.
Metrics: The Sooner the Better
As with most cultural and operational changes, implementing strong tools and processes early on is ideal. What you choose to optimize for in the early days often affects your engineering culture as you grow. Temporary fixes often live longer than expected.
Of course, this advice is little help to organizations that find themselves playing catch-up. My best advice for those organizations is to use new projects to drive the way you want your infrastructure to look going forward, and then use those new patterns to modernize your old infrastructure. Trying to add telemetry and tracing to legacy projects will usually end badly. Having a well-defined and understood future state will help you go back and modernize legacy applications. Creating a standard around new projects lets you test out technologies in a safe way. Then, you can better understand the health of your applications so you can stay ahead of performance issues, customers issues, and scale.
If you want to learn more about the how to succeed with monitoring and telemetry, check out the talk I gave at Monitorama a few years ago.
Finally, remember that, if you want your Devs to do more Ops, you need to build consumable services for your developers. You need to make sure they have appropriate ownership and accountability for the software they deliver to customers. The bottom line: Engineers build better software when they are responsible for its health. We’ll cover this topic in depth in the next post in our series.