A Root Cause Analysis of a Failure of a Root Cause Analysis of the Failure of Root Cause Analysis

Written by Ieuan Wickham on January 14, 2014. Posted in Articles.

A colleague recently sent me a link to a blog describing a failure of root cause analysis, which discusses root cause analysis in the context of software development. The author, Mike Cohn, describes a scenario of applying Five Whys to a flat tyre, and discovers that the flat tyre happened because it rains. In this case root cause analysis has turned out to be of little use, and Mike asserts that it’s also less useful with software development or product development. There were a whole bunch of things in the blog that bothered me. But Mike is no slouch; he’s a well-respected agile practitioner, major contributor to Scrum, and co-founder of the Scrum Alliance.

Exactly five whys – no more, no fewer?

The Five Whys is a reminder that just asking ‘why?’ once usually isn’t enough, and sometimes it takes asking ‘why’ as many as five times to get to the bottom of the problem. But you might get there in three – or it might take six. Unless there’s an element of the ‘why’ that’s in our control, seriously consider whether it’s worth continuing.

There can be only one…

… root cause? Not so! Almost all interesting effects or problems have multiple causes. Or, as my wife puts it (because she’s a doctor and likes to use big words even more than I do), they’re multifactorial. Why is Mike’s tyre flat? Because it got a screw in it. Why was there a screw in the tyre? Or we could ask – why did the screw cause the tyre to go flat?

Is it necessary? Is it sufficient?

Because there are often a multitude of root causes, it can also be useful to identify whether they are necessary or sufficient or both, either singly or in groups. Necessary conditions are those which must be met for the effect to occur. Sufficient conditions are those which will always cause the effect. If we cannot form the root causes we have found into at least one group that is both necessary and sufficient, then there will be other (perhaps more useful or informative) causes out there.

Which is the bigger problem? The one you’re analysing, or the time spent doing the analysis?

The author has decided that he wants to prevent the wastage of the five minutes that it took him to fix the flat tyre. He hasn’t described how often he has a flat tyre, which is a critical factor when considering how much time to invest in fixing the root cause. In this case, spending more than a few minutes thinking about why it happens is quite possibly a waste of time. Or, as the author himself puts it later in a comment: “My point remains that some root causes are not worth fixing. The real issue is that the cost of fixing a root cause can sometimes outweigh the cost of living with the problem.” Fair enough, but sometimes all that means is that you’re looking at the wrong root cause. The point Mike’s actually made with his example is that the effect or problem is not worth analysing.

Why the Five Whys?

I’m not convinced this is the best technique for root cause analysis. I find that the Five Whys technique on its own is most effective when investigating motivations for a requirement, because it suits a more linear cause-effect model that I find prevails in why people say they do the things that they do. For example:

Why do we need to upload the file in this format? Because that’s what the suppliers give us.
Why do the suppliers give us that format? Because that’s what we tell them to use.
Why do we tell them to use that format? Because we think it’s going to be easy for them.
Why do we think it’s going to be easy for them? Because we find it easy to manually work with.
Why are they manually working with this file? We don’t know, we just assume that they are, because at the moment we’re manually working with the files.

There’s useful information not just at the end of the chain, but all the way through – and there are still opportunities to branch your questioning.

Is root cause analysis less useful for software development?

In the original article, the author says:

“My point? Sometimes root cause analysis isn’t worth it. As with any tool or technique, it can be misapplied. A great deal of software development is novel–the thing being done or thought about is being done for the first time. (At least by that individual or team.) Root cause analysis is most useful when situations are repeated–as in manufacturing. The technique is less useful with novel, complex work such as software development (or even product development more generally).”

I’m very sorry, Mike, but I’m afraid I have to disagree. Yes, you’ve made the case that sometimes root cause analysis shouldn’t be done, particularly when the impact of the problem is small compared to the effort involved in looking at it. Yes, it can be misapplied. I’d definitely back you if you were to argue that it’s overkill for most bugs, where once the problem has been identified it can be fixed and we all move on.

But Mike hasn’t made the case (particularly with that example) that root cause analysis isn’t helpful with novel and complex work. I find that root cause analysis is in fact particularly useful with complex work, because there are usually a multitude of factors. Sometimes, if you find the right one, it’s possible to make a small change to effect a bigger one. It’s also been my experience that major and serious issues in software development come back to a sadly familiar set of root causes. With a quick rebranding, the Manufacturing groups can easily apply – Technology, Process, Data, Mindpower, QA, or Environment.

A different example

Over a decade ago, I was part of a team who were in a grotesque hurry in the middle of a death march, somewhere at the beginning of what was to be six months of 80-hour weeks, and approximately three months out from a go-live date that was carved in stone (or at least written in legislation). I received a data summarisation process and report to the client for user acceptance testing, for which I’m pretty sure I had provided the requirements document. The data summarisation process was to run daily, to provide the summary information for the report that would run monthly, because the development team wasn’t happy with the report performance.

One test run told me that a) the output was wrong and b) the performance was appalling. A re-run using production hardware and database configuration told me that it would take approximately three and a half years every night. A superficial ‘five whys’ analysis might lead us to believe that the root cause was a bad design decision, or poor requirements documentation. Maybe we could fix that by shouting at the designer or business analyst for a bit. (I confess that shouting at the designer made me feel better at the time – sorry, Graham.) But a more in-depth analysis, maybe using Ishikawa’s techniques, would perhaps conclude that we had at least the following factors at play:

Technology
- The development team didn’t have access to production-quality hardware.
Process
- The release process didn’t include performance testing, even where potential performance issues had been identified.
Data
- The development team didn’t have access to a production-size data set.
Mindpower
- All minds involved were working long hours under pressure to deliver.
QA
- The testing that was performed, did not test the actual requirements.
Environment
- The client and the development house were separated by thousands of kilometres and several time zones.

For this organisation, these were all causes that had resulted in delivery issues in the past, and (since this kind of analysis wasn’t actually performed) would continue to result in delivery issues.

Root cause analysis can work for you

Mike argues at the end of his article that root cause analysis shouldn’t be automatically performed, but done on the right problems, and I agree completely. Furthermore, done properly on the right problems, it is more than just an exploration of a single chain of causality. It examines the multitude of causes that contribute to the result, and attempts to determine where best to make the changes that will have the greatest effect.

In the rare cases that software or IT organisations are self-confident enough to examine why things went wrong, and to commit to fixing the root causes, I believe they’ll find themselves well rewarded. To argue that software development is novel and complex, and that therefore root cause analysis is less useful is, I think, flat-out wrong.

Don’t forget to leave your comments below.