The Business of DevOps
DevOps is a unique discipline.
It is hard to define.
If you ask ten DevOps engineers how to define it, you will get ten different answers.
There are DevOps engineers who live and breathe Infrastructure as Code. But then there are DevOps engineers who hardly touch it.
Some of us know data science very well. Some of us know how to do application development.
“Jack of all trades” is a common term thrown around a lot in DevOps. And for this reason, a lot of people just don’t know what a DevOps engineer does because, frankly, we do almost anything.
However, there is one intersection in DevOps that I think is overlooked quite often, and I feel like it is the most important one. It is the intersection of business and technology.
Look at the DORA metrics: deployment frequency, lead time for changes, change failure rate, and mean time to recover. If you look deep enough, each one of these is actually a key performance indicator (KPI) for business.
Let’s dive into each.
Deployment Frequency measures how often the business delivers new functions or capabilities to end users. The business makes an “investment” by hiring developers to build applications that serve customers.
Each feature that a developer completes has a business value attached to it. That value could be revenue generating, revenue retaining, or cost reducing.
The faster a feature is shipped into production, the sooner the business gets a return on that investment.
To contextualize this, think about the effects of compound interest. A $10,000 investment at 6% annual return rate will yield $2 per day at the end of the first year. But after 30 years, it will make almost $10 per day; a 5x increase.
This illustrates the importance of capturing a return on investment as quickly as possible. The sooner it starts generating money, the faster the money will grow.
When a business ships a new feature, the goal is to
- gain new users
- keep existing users
- upsell existing users
The more money the business makes, the more it can reinvest into its products to deliver more value to customers, which generates more money.
Speed is key.
Lead Time for Changes
Lead Time for Changes is another speed metric, but a different kind.
This measures how long it takes for the business to
- identify the need for a new feature
- articulate the new feature in technical terms
- queue the feature for development
- develop the feature
- ship the feature to production
From a business perspective, this measures how well the internal processes are working to capture, define, develop, and ship features.
But a change is not a feature. A feature could consist of dozens of changes. The lead time for change metric informs the business about the average time it takes to ship each change.
Just as with deployment frequency, the speed at which the business can complete this process directly affects the top line.
But lead time for changes also impacts the bottom line, or cost.
Think about it this way: each feature has a cost associated with it. Every product owner, developer, manager, and executive stakeholder has to participate in the feature realization process, from conception to birth. Every one of those meetings, emails, documents, false starts, external dependencies, and issue tracker updates costs the business money.
The longer it takes to develop a feature, the more costly it is to the business.
As an example, imagine it costs the business $10,000 to develop a feature they believe will attract more users. This means that the feature must generate more than $10,000 to be profitable and for the business to get a return on the investment. Now imagine that the feature is stuck in analysis paralysis. Several meetings take place to discuss the feature. Multiple developers are consulted. Multiple product owners provide feedback. All of a sudden, the cost of the feature has ballooned to $20,000. Now it will take the business twice as long to receive the return on the investment.
But let’s talk about an even more realistic example.
Jimmy The Salesman is talking to a potential customer. They like the product, but it’s missing a key feature that they need in order to sign up. Jimmy goes back to the engineering team and asks how long it would take to implement the requested feature. They give him a 10 week estimate, which Jimmy’s potential customer accepts as reasonable, then they sign a contract for $5m, contingent on the timely delivery of the feature they require.
Lead time for change just became much more important to the business!
This metric helps hold the engineering teams accountable to the estimates they provide when developing a feature. The more consistent the lead time for change is, the higher the accuracy of effort estimates.
Change Failure Rate
Next up is the Change Failure Rate metric.
We all know downtime is bad, but why is a failed change bad?
You just roll things back, right? And if you have advanced deployment methodologies like blue/green deployments and the entire thing is automated with no downtime, it’s not that big of a deal, right?
If you haven’t grasped anything in this article, I want you to understand this: software companies make money by shipping new features.
These new features acquire new customers, upsell existing customers, or retain existing customers. When the business fails to ship new features, they are missing new revenue opportunities and risking existing revenue streams!
Change failures happen for all kinds of reasons, but all of those reasons create additional cost for the business. Triage must take place. Developers have to debug and develop bug fixes. New tests have to be written. All of these things must be coordinated by humans, which draw a salary.
This is why change failure rate is such an important business metric.
Not only does it decrease potential revenue, it also increases cost. Double whammy.
Mean Time to Recover
The last metric is Mean Time To Recover (MTTR).
As it clearly states, this metric measures how long it takes the business to
- identify a problem
- diagnose the problem
- restore service
DevOps engineers are tasked with building highly resilient systems that infrequently fail.
But what happens when they inevitably do fail?
The first thing that most engineers learn about in operations is the term Service Level Agreement (SLA). This is a contractual agreement between a business and its customers that defines an explicit amount of time that the business’ service is up and available.
There are many types of SLAs that range from UI response latency to transfer speeds to transactions per second. But there is one thing they all have in common: lost profit.
Customers receive credit on their accounts when an SLA violation takes place. That credit is a financial liability and can only be recovered by generating new revenue.
Unlike with a failed change that can be recovered later, SLA payouts are total losses.
To make matters worse, punitive damages are not deductible on company taxes like other expenses. Another double whammy!
If you want to get on a CFO’s bad side, be responsible for causing an outage that generates a large SLA payout.
In conclusion, take your engineer hat off every once in a while and remember to wear your business cap. You may find that it comes more naturally than you realize.
Once you feel comfortable thinking in these terms, you should begin speaking like this with your peers and your manager. I have found it to bring more meaning to my job when I know the underlying business fundamentals of the decision making process.
When you can frame questions about what you are doing and why you are doing it in business terminology, you will be surprised to hear the answers you receive.
This kind of mindset is what sets a senior developer and a principal developer apart.