Worthy metrics

This is part four - the final part - in a short series of articles on the nature of metrics generally but within product and especially software development particularly.

In the previous article we outlined the general characteristics of good metrics. Here, we introduce a general framing and some of the specific metrics WORTH commonly employs in developing products and services.

There is never a universal solution to anything. Take these as typical or starting positions. We believe they will be valuable in the vast majority of cases but, as we outlined previously, context is everything and context change is inevitable.

The big questions

In delivery, we see three broad questions for the systems we act on. These are, in ascending order of importance:

Are we flowing efficiently
Are we creating risk?
Are we creating value?

There is an intentional counterbalance these three have with each other. Flow without value is nothing. Value undermines itself if it generates risk. Undue risk avoidance that prevents flow is counterproductive.

We should also make clear that we are invariably within software development. When we speak of “the system” in that context we are referring to a combination of the assets we work on, the team itself, and the customers and users we are trying to create value for. Systems thinking is holistic like that.

Flow

By “flow” we mean the continuous and (approximately) stable output of the system. If we are not flowing effectively, output is unpredictable. That means costs escalate, dependency queues start to form, and we inhibit our ability to understand the cause and effect of our actions. In Cynefin terms we start to regress towards chaotic states.

Flow is by no means unimportant, but it is the least important of our questions because it is about optimisation. Flow is exactly and entirely about output, not outcomes. It is possible, indeed likely, that a team can create value without good flow states, and that is still a positive outcome. However, good flow that fails to create value is a negative outcome in all cases.

What we measure

Based on Little’s law, by far the most important possible flow metric is the mean cycle-time through the system. The second is the related mean lead-time (arguably a value metric, but important for a secondary understanding end-to-end bottlenecks). The simplest way to explain these two metrics is a diagram.

The beauty of cycle and lead time is that they are essentially inherently incorruptible metrics. Short of redefining what the checkpoints mean, there’s no way to cycle faster without just cycling faster.

This metric is just one of the four key metrics the excellent Accelerate book identifies. Deployment frequency is a useful secondary proxy, but isn’t always available or as-yet useful information in the transformation and greenfield contexts WORTH often engages in. Mean-time-to-resolution (MTTR) and change-failure rates are more purely operations-focused but have clear tactical and predictability benefits. We’ll often use them internally, but they won’t be a headline number early in product development.

Key points

The events you choose to define as the request, start, and delivery points doesn’t greatly matter for the purposes of measuring flow so long as they remain consistent. In practice it works best by keeping those to the extreme ends that the team can control. If work sits with a PMO awaiting funding for months or years* before the team sees it then you can only metric the team’s lead-time from that point it arrives in their backlog. Measure the PMO’s lead-times separately to improve them!
The trigger points we are expecting here are not values but rates of change. We anticipate most teams will gradually reduce their cycle-time down to an arbitrary ideal work-unit size (we like to use 1 day as an easy standard). That means we’re really looking for sudden changes and trends that start moving away from the ideal.
Usually degraded metrics means either the work units are too big and need to be more finely sliced, or a bottleneck has developed.
This one metric allows us to eliminate estimate-based planning in favour of probabilistic data-based approaches using Monte-Carlo simulations, but that again is an article in itself.

*yes that absolutely happens

What we don't measure

We recently interviewed someone who explained that in their current employment they had a development manager who defined how many story points each type of work should be, estimated all the work, and set a quarterly goal for all the teams (there were about five or six) to reach a velocity of 64 points each sprint. This was up from the previous quarter and would increase next quarter. Hence the interview.

We don’t measure velocity.

Risk

By “risk” we mean the negative future impact of our current work on the system. Risk is more important than Flow because by definition it represents an existential threat to the future of the organisation to a greater or lesser degree.

Thinking as we do of the widest definitions of the system, we see two main types of risk. The first and most severe risk is that of the team’s morale.

Simply put culture is incredibly fragile. Humans are social creatures; all achievement ever has been through joint effort and shared ambition. Yet while a strong culture enables the spectacular, a poor culture cripples everything and is incredibly costly, sometimes impossible, to recover. It is therefore vital to understand early signs of decay, and we do this by trying to metric morale.

The second risk is that of degrading the technical assets we are working on, commonly expressed as technical debt. This is the recognition that the work we do now must not be done in such an unmaintainable way as to break the next. Software that can’t be changed is already dead.

Measuring morale

Morale is understandably quite a difficult thing to put numbers to. An individual’s subjective feeling is always by definition “true” here though. The approach we take is based on Daniel Pink’s amazing work on

in thought-work contexts. For the team we focus on purpose and autonomy (which we express as alignment and agency), leaving mastery to the individual’s career plan.

We draw this into a 2x2 grid (because we are after all a consultancy). Then the team members put their individual dots on the graph. Looks like this:

This is actual data from one of our teams. Doing well, but a couple of dots look like they have something to say...

Repeating this on a regular (e.g. sprint) basis allows us to look at movement of the cluster, outliers, and splits. We can do this anonymously (to remove anchoring) or in-person (for shared instant feedback) and it’s good to switch back and forth. There’s no wrong here and it’s allowed people an opportunity to clearly register discomfort. Actions will obviously be very contextual and sometimes personal but the trigger points themselves are now obvious.

Measuring technical debt

Technical debt is thankfully a much more objective and (generally) neutral subject. It’s also largely a solved problem through the use of SonarQube. Yes, there’s some back-and-forth engineers will always have about what constitutes good and bad (non-engineers would be shocked how we can be). By-and-large though, SonarQube has just made this a commodity. Professional developers can leave their personal preferences in github and just get on with the job.

Actual data from one of the assets we’re developing. Maybe our TDD skills need a brush up.

SonarQube both headlines a single objective debt score, and breaks down several aspects - duplication, coverage, code smells, etc, so there’s clarity and depth. We’ll typically augment this with deeper security/vulnerability tools as well. SonarQube is free and you can have it running within a day so it’s a go-to solution for us.

The trigger points here are clear and objective. You can and should (and we do) even automate them into your build pipeline. Doing so creates a system that simply doesn’t allow itself to degrade without deliberate manipulation. We’ve even been able to apply this to legacy systems by introducing gate conditions that demand marginal incremental improvements. The “boy scout” rule; leave it better than you found it.

What we don't measure (yet)

The concept of technical debt has suggested the notion of flow debt and design debt (aka product or feature debt). There's some elegant symmetry here but they're emergent ideas which need thought and field experience to develop further.

What is fairly clear at this point is that a mature team that is aware of these debt concepts and can effectively metric them is in a position to manipulate them in its favour, making tactical trade-offs to achieve better, faster, outcomes. A promising future.

Value

By “value”, we won’t split hairs on terminology but customer-value, business-value, or impact are all equivalent enough. That is to say value is the magic quality that makes your activities actually worth doing. Classically the activities that create revenue, avoid costs, protect market share, etc.

Value is the most important metric because it too has an existential impact on the organisation, but its effect is both immediate and subtle. Without creating value it doesn’t matter how efficient your flow or well managed your risks are.

As value is always entirely specific to the client’s context there can be no go-to metric here. It’s something we have to identify at the start of every engagement through research and strategy, and it represents - for WORTH - how we demonstrate our own impact to the client as well as ourselves.

Product strategy in general is a subject that we’ll be writing on in future, but for our purposes here we strongly favour “product bet” approaches. One such is Amplitude’s North Star framework which is explicit in creating structured control metrics as a product discovery and development mechanism.

Hypothetical North Star metrics, Source; Amplitude

What we are learning to measure

We started this series with the example of COVID-19, which will sadly persist for some time. As I write this we’re in the midst of the largest civil rights action in history. A protest against another set of stark metrics. And there’s still carbon problems ahead.

It’s very clear to us that the time of being able to remain ignorant or even inactive about the social and environmental impacts we have on the world is at an end.

In 2015 the UN defined seventeen Sustainable Development Goals (SDGs) as a platform for driving private and public capital into creating better futures for us all by 2030. We believe we and our clients can use this as a basis to be sure that not only are we doing no harm, but that we can pivot towards the good. Sustainability metrics will form part of the decision making process, but also let us celebrate and champion that progress is possible.

We’re still working out how we make this work in practice, looking to learn from various leaders in the field, and we are optimistic.

Last thoughts

Bad metrics are harmful, and they are everywhere, and they needn’t be either.

However - and we cannot stress this enough - absolutely no customer in the world is impressed by how good your development metrics are. Nobody buys a product because you have great cycle time. Nobody extends a contract because your technical debt is near zero. You aren’t selling your metrics.

The mic-drop is that good metrics can help guide you (and poor ones will mislead you) but they don’t fundamentally matter. Too much concern for metrics, now matter how well developed, is just falling into McNamara’s fallacy once again. Get qualitative data too. Talk to your customers. Talk to your staff. Engage.

As an industry, digital product and service development hasn’t historically done a very good job of educating itself on why these things matter and how to get this right. That’s slowly starting to change and WORTH believes we can contribute to improving the world with these humble steps; through education, through engagement, and by championing what matters.

We measure to learn, not to control. The latter is an illusion, the former is how all things progress.

Recent postsSee All