Measuring the Productivity of Developers
McKinsey: Yes, you can measure software developer productivity
McKinsey adds five metrics to the DORA and SPACE metrics (see also the book Accelerate by Nicole Forsgren et al.). It claims that these additional metrics are much better suited to measure the productivity and performance of organisations, teams and developers. McKinsey targets non-technical C-level executives like CEOs and CFOs. The goal is to make the performance of software engineers as easily measurable as the performance of sales people, recruiters and factory workers.
Gergely Orosz and Kent Beck teamed up to write an insightful critique of the McKinsey article in their newsletters The Pragmatic Engineer and Software Design: Tidy First? Before I summarise and comment on Orosz’s and Beck’s responses, I’ll try to give you a quick overview of the McKinsey metrics.
Inner/outer loop time spent. McKinsey defines an inner loop for development activities (code - build - test) and an outer loop for non-development activities (deploy at scale - security and compliance, integrate - meetings). They measure the time developers spend in each loop. Top performing companies should ensure that developers spend more than 70% in the inner loop.
Developer velocity index benchmark (DVI). The DVI measures the maturity of an organisation with respect to technology, working practices and organisational enablement. Based on the DVI, a company can compare itself with its peers. The DVI knows seven categories from low to high developer velocity, that is, from worst to best performing.
The capabilities yielding the most return on investment are Empowering developers with world-class tools, Creating a culture that fosters psychological safety, Creating a comprehensive product-management function and Focusing talent management on the developer experience (see Exhibit 3 in the article Developer Velocity: How software excellence fuels business performance).Contribution analysis measures how many code changes, tests, documentation changes, backlog items and other easily measurable things an individual or team contributes. It can highlight performance problems.
Talent capability score. This “score is a summary of the individual knowledge, skills, and abilities of a specific organisation. Ideally, organisations should aspire to a ‘diamond’ distribution of proficiency, with the majority of developers in the middle range of competency.”
Retention is the percentage of people staying in an organisation per year. Attrition is the opposite.
Part 1: Response by Gergely Orosz and Kent Beck
Orosz and Beck come up with the above mental model of the software engineering cycle.
Effort includes planning, designing, coding, testing and shipping.
Output is the code and tests written, the user and developer documentation, and the feature in the product.
When customers behave differently in response to an output, you observe an outcome. For example, users can do their jobs faster or with less errors because of a new feature.
The impact is the “value flowing back to [the software development organisation] like feedback, revenue, referrals” caused by an outcome.
I was a bit surprised by the definition of impact. I would have expected it to refer to the impact on the customer value and not on the value for one’s own organisation. After all, XP and Scrum strive to maximise the customer value. After some pondering, it clicked. Orosz and Beck write about their experience at companies like Uber and Meta, who develop software for their end users. Then, the value for one’s own organisation is the same as customer value.
For consultancies, these two values differ. Consultancies develop software for a customer, who releases the software - with some additions - to its end users. Having an intermediary between the consultancy and the end user makes measuring productivity even more difficult. Now, I fully understand why effort-based hourly rates are so easy and why value-based pricing best based on outcome and impact metrics is so hard.
The authors apply the model to the DORA metrics:
Deployment frequency, lead time for changes and mean time to recover measures outcomes.
Change failure rate measures impact.
Compare this to the McKinsey metrics:
Inner/outer loop time spent, developer velocity index benchmark and talent capability score measure effort.
Contribution analysis measures output.
Retention measures impact.
One critique of McKinsey’s system we have is that nearly every one of its custom metrics […] measure effort or output.
What’s wrong with this approach? First, the only folks who care about these metrics are the people collecting them. Customers don’t care. Executives don’t care. Investors don’t care. Second, and most crucially, collecting & evaluating these metrics hinders the measures downstream folks actually do care about, like profitability.Gergely Orosz and Kent Beck, Measuring developer productivity (Part 1)
Effort and output metrics are easy to measure but also easy to game by developers. If you count the numbers of tests written as prescribed by contribution analysis, you will get a lot of alibi tests. It takes another developer to spot the alibi tests. Then, the two developers will argue in front of a less technical minded manager whether the tests are useful or not. More often than not, the second developer will lose this argument and give up in desperation after several tries.
The metric inner/outer loop time spent is also easy to game. Just don’t care about integration, security or deployment, because this is someone else’s responsibility. These outer-loop activities are executed sloppily or are outsourced to expensive specialists. This is fine with me but short-sighted for the organisations.
Outcome and impact metrics are harder to measure and harder to attribute to individuals than effort and output metrics. In other words, they are better suited for measuring teams and organisations. The DORA metrics are a good example.
We urge engineering leaders to look to outcome and impact measurements, to identify what to measure. It’s true that it is tempting to measure effort. But there’s a reason why sales and recruitment teams are not judged by their performance in being in the office at 9am sharp, or by the number of emails sent – which are both effort or output.
Gergely Orosz and Kent Beck, Measuring developer productivity (Part 1)
The most important thing about metrics is: When bosses start using productivity metrics for performance evaluation - as McKinsey suggests, the metrics will be gamed and will become useless. Use productivity metrics only to identify and remove impediments.
If you - as a developer - want to measure and improve your productivity, the authors have two tips for you.
Aim to only have one red test at a time when using test driven development (TDD). This approach measures both effort and output. Get to the point where you can confidently, deliberately and consistently have one red test, when you expect one red test.
Set a goal to merge a pull request every day, and track this goal over a week. This measure includes both effort and output. This goal forces you to do smaller commits, which are easier to review and get signed off quicker. It also pushes you to write code that’s correct and follows team standards.
Gergely Orosz and Kent Beck, Measuring developer productivity (Part 1)
The first approach ensures that your code conforms to the single-responsibility principle. The second approach reduces the lead time to change and the deployment frequency - two of the DORA metrics. It also forces you to stick to one of the principles of Continuous Delivery: work in small steps.
Part 2: Response by Gergely Orosz and Kent Beck
Team performance always beats individual performance. For example, sales people tend to optimise their own bonuses. They may overlook a much bigger sales for their company, because they don’t work together with their colleagues from other regions or industries. What’s best for an individual is often not best for the organisation as a whole.
Individual performance does not directly predict team performance. […]
Team performance is easier to measure than individual performance. […]Gergely Orosz and Kent Beck, Measuring developer productivity (Part 2)
This has been known in systems thinking for decades. Just consider a group of people - a team, organisation or company - as a system.
[5:14] A system is not the sum of the behaviour of its parts, but it’s the product of its interactions.
[5:27] If we have a system of improvement that’s directed at improving the parts taken separately, you can be absolutely sure that the performance of the whole will not be improved.
Russell L. Ackoff, If Russ Ackoff had given a TED Talk (video), 1994.
When the big bosses ask “How can we measure developer productivity?”, they are actually looking for an answer to a different question: “How much should we invest into engineering?” The answer:
Place small but inexpensive bets: and double down on the ones that show tangible promise.
Gergely Orosz and Kent Beck, Measuring developer productivity (Part 2)
This is similar to how I decide which recurring customer needs to turn into productised services. When trying out a productised service, I always keep the following in mind.
I probably have to develop a first version of the [productised service] on my own time and money. The first version must provide good value for customers with a predictable risk for me. It certainly shouldn’t get me into financial troubles. There should be a good chance to make a profit.
Episode 43 of my newsletter: 10 Years of Solo Consulting (Part 2)
Beck and Orosz summarise their response in the final section “How do you answer the question?”. Each of them gives his own summary on his newsletter.
Summary by Kent Beck
While Beck sees good value in “self-measurement for self-improvement”, he regards measuring developer productivity as impossible.
Measure developer productivity? Not possible. There are too many confounding factors, network effects, variance, observation effects, and feedback loops to get sensible answers to questions , “How many times more profit did Joan create than Harry?” […]
Kent Beck, How do you answer the question?
As soon as organisations derive incentives from productivity metrics, people will game the metrics to their own advantage. The damages to the organisation will be grave.
The merit rating nourishes short-term performance, annihilates long-term planning, builds fear, demolishes teamwork, and nourishes rivalry and politics. It leaves people bitter, crushed, bruised, battered, desolate, despondent, dejected, feeling inferior, some even depressed, unfit for work for weeks after receipt of rating, unable to comprehend why they are inferior.
W. Edwards Deming, The Merit System: The Annual Appraisal: The Destroyer of People, 1986 (reprinted in the book The Essential Deming, p. 31)
Amen to that! I see these consequences in the organisations of my customers. I have experienced them myself as an employee. And - avoiding them was a major motivation in working as a solo consultant. Profit is an excellent metric for how I am doing.
At the very end, Beck gives a simple but useful way of find out whether a team is doing good work. It sounds a lot like the deployment frequency from the DORA metrics - only with the emphasis on a delivery valuable to the customer.
Weekly delivery of customer-appreciated value is the best accountability, the most aligned, the least distorting.
Kent Beck, How do you answer the question?
Summary by Gergely Orosz
[Orosz explains] what happens when you measure each of the areas:
Measure effort: create high-effort busywork of dubious value.
Measure output: increase the quantity of the output by what’s easiest to do. This might not help with outcomes or impact.
Measure outcomes: aim to beat targets, even if this means taking shortcuts.
Measure impact: get creative in reaching this with less effort and output.
Gergely Orosz, How do you answer the question?
The consequences may be good (desired) or bad (gamed). If developers are measured by the number of pull requests, they will push smaller changes more frequently. If each pull request adds value for the customer, the consequences are good. It leads to a higher deployment frequency and to a shorter lead time for change. If developers split pull requests into one for each file changed, the consequences are bad.
This leads to the following advice:
When you measure effort or output, you change the behaviour of people. Just make sure that the change happens in the desired direction.
Prefer outcome and impact metrics over effort and output metrics, and team metrics over individual metrics.
When outcome or impact are not as expected, use effort and output metrics to find the reasons.
Managers, who understand - on a technical level - what their engineers are doing, can make their teams much better than any fancy productivity metrics.
Running a high-performing team is a hands-on approach. Looking to metrics to get numbers that show that a team is high-performing is wishful thinking.
Gergely Orosz, How do you answer the question?