Waarom statistieken mislukken

Dit is deel twee in een korte serie over de aard van metrieken in het algemeen en binnen softwareontwikkeling in het bijzonder. In het vorige artikel schetsten we twee recente voorbeelden van het reële gevaar van slechte statistieken. In dit artikel schetsen we een deel van de theorie over waarom slechte statistieken voorkomen.

Image holder

Goodhart’s Problem

The stated policy goal of those UK and US authorities cited previously is to “reopen the economy”. It is therefore somewhat ironic that economists have several well-known and well-evidenced laws about this.

Dating back to at least 1969 describing and cautioning against exactly this behaviour (economist language ahead, stay with me):

"Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes. Goodhart’s Law, 1975
“Given that the structure of an econometric model consists of optimal decision rules of economic agents, and that optimal decision rules vary systematically with changes in the structure of series relevant to the decision maker, it follows that any change in policy will systematically alter the structure of econometric models.” The Lucas Critique, 1976
“The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.” Campbell’s Law, 1979

Catchy stuff. Perhaps the dry language of economics has inhibited public awareness somewhat. In plainer language:

“When a measure becomes a target, it ceases to be a good measure and the target no longer means what you think it does.” Marilyn Strathern, appended by Doc Norton

This is the metric-induced root cause of negative yet widespread system behaviours. Teaching to the test, casual corporate violence, and billions of dollars of fines and litigation. All of those examples created by gamed metrics.

You absolutely get the behaviour you incentivise, usually in unintended ways. Sometimes with shocking boldness.

Drucker’s Problem

The forces that push for both creation and corruption of metrics are deep-running in our systems, our culture, and even our biology.

Our existing digital-era is built on top of the systems of the past; the atomic and industrial ages. Adam Smith, Fredrick Taylor, and Henry Ford still define the foundations of management and enterprise. Men whose theories assume that the workforce is essentially lazy and cannot be trusted. That maintaining productivity requires the continuous assessment of workers who must be set specific limited activities with quantified output targets. Hardly surprising given foundations in racism and slavery, and entirely unfit for this century.

The McNamara fallacy is the belief that one can make effective decisions solely through quantitative metrics. It's named after the US Secretary of Defence who ineffectually presided over the early Vietnam war. Unable and unwilling to comprehend the grand failure his metrics were telling him wasn’t happening, McNamara’s approach led to bizarre and lethal policy consequences worth hearing. Sadly McNamara isn’t even the only Secretary of Defence to fall to this:

“Rumsfeld was obsessed with achieving positive 'metrics' that could be wielded to demonstrate progress in the Global War on Terror.” Jon Krakauer, "Where Men Win Glory”

This is the difference between being data-driven and data-informed. The fallacy is at the core of the oft-quoted “if you can’t measure it, you can’t manage it”. Usually attributed to Peter Drucker, a man described as “the founder of modern management”. An absolute visionary but he never said it (and neither did Deming). Indeed, what he actually said was pretty much the opposite:

“What gets measured gets managed — even when it’s pointless to measure and manage it, and even if it harms the purpose of the organisation to do so.” Peter Drucker

Correlation may not be causation, but inference is fast and easy.

The Context Problem

However, Taylorism wouldn’t be the shaping force it is if it didn’t fundamentally work. Past tense mostly, because the industrial context it arose in was one of almost entirely linear manual effort. Assembly line manufacturing. The work of recreating a past state repeatedly. These deterministic ordered domains lend themselves very well to simple equations: productivity directly leads to output and output to profit. And that coincidentally fits very well with Western psychological/cultural needs about justice and free will and so on. It's still inhuman, but the maths works at least.

Unfortunately, this immediately breaks down in non-linear contexts with uncertain end-states, such as in the now dominant digital economy. These are the emergent (or unordered) domains of thought effort. Discovery. The work of obtaining an uncharted future state. Intensely productive periods of effort may yield nothing. Or everything. Or the wrong thing, or the right thing that you didn’t know you even wanted. You can’t make the future happen faster through sheer effort. The moon landing wouldn’t have happened earlier if only we had scienced harder.

This is an irreconcilable paradigm shift. Tools are context sensitive. What was effective in one domain is not in the other. Unfortunately, the awareness of context change is far from obvious and humans tend to apply the tools they have. In this case output-focused productivity metrics.

Some Systems Thinking

It is vital to understand that the system acts on itself. Contexts slowly change over time because of the activity of the participants. That means ourselves, our competitors, and the customers, but also external actors. That’s because contexts exist within other larger contexts, and themselves contain smaller contexts.

It's a beautiful mess really, but it causes a kind of whack-a-mole local optimisation problem. Improving a metric may well have the appearance of success from one perspective, but is actually causing damage to another. A classic example is contract selection on the basis of cost, which saves money in the annual budget. Great! Unfortunately the lower quality of delivery causes other areas to suffer like customer retention. Hmm.

Signal / Noise

We can model any human endeavour as a kind of economy, so laws of economics tend to apply universally. Software development included.

In its formative stages, as per Drucker’s problem, the software industry developed some particular practices. They fit the prevailing Taylorism of business so well they became, and remain, standard practice. Velocity, say/do, story-points, lines-of-code counts, defect counts, code churn… all of these are vulnerable to Goodhart’s problem. Unhelpful and misleading at best, abused and abusive at worst. So ubiquitously so, it's past time to leave them behind.

Technology in 2020 is also the realm of unimaginably vast data. Far larger than anything we have ever seen before. We've needed to invent new fields of information science just to handle the problem. By 2025 we’ll be creating almost half a zettabyte of data daily. For a sense of scale, in 2016 Cisco estimated the whole of the internet at about one zettabyte total.

All that readily accessible data in the hands of a cadre of extremely data-friendly engineers creates swathes of potential metrics. And so, inexorably, the traditional management pressures create new markets for operational infrastructure monitoring and business intelligence dashboards. Lots of numbers, lots of charts, lots of metrics.

If what gets measured gets managed, in a sea of data we are in danger of drowning in management.


Metrics are easily corrupted. We tend to think in ways that corrupt them, and we have a lot of opportunities to do so. It's a tough challenge, but by no means an impossible one. In the next article we’ll look at approaches to general solutions, and characterise what 'good' looks like.

Wat is uw situatie?

Laten we contact opnemen en onderzoeken hoe we jouw initiatief succesvoller kunnen maken. Wat beschrijft jouw situatie het beste?