Skip to main content

Author: Ieuan Wickham

A Root Cause Analysis of a Failure of a Root Cause Analysis of the Failure of Root Cause Analysis

A colleague recently sent me a link to a blog describing a failure of root cause analysis, which discusses root cause analysis in the context of software development. The author, Mike Cohn, describes a scenario of applying Five Whys to a flat tyre, and discovers that the flat tyre happened because it rains. In this case root cause analysis has turned out to be of little use, and Mike asserts that it’s also less useful with software development or product development. There were a whole bunch of things in the blog that bothered me. But Mike is no slouch; he’s a well-respected agile practitioner, major contributor to Scrum, and co-founder of the Scrum Alliance.

Exactly five whys – no more, no fewer?

The Five Whys is a reminder that just asking ‘why?’ once usually isn’t enough, and sometimes it takes asking ‘why’ as many as five times to get to the bottom of the problem. But you might get there in three – or it might take six. Unless there’s an element of the ‘why’ that’s in our control, seriously consider whether it’s worth continuing.

There can be only one…

… root cause? Not so! Almost all interesting effects or problems have multiple causes. Or, as my wife puts it (because she’s a doctor and likes to use big words even more than I do), they’re multifactorial. Why is Mike’s tyre flat? Because it got a screw in it. Why was there a screw in the tyre? Or we could ask – why did the screw cause the tyre to go flat?

Is it necessary? Is it sufficient?

Because there are often a multitude of root causes, it can also be useful to identify whether they are necessary or sufficient or both, either singly or in groups. Necessary conditions are those which must be met for the effect to occur. Sufficient conditions are those which will always cause the effect. If we cannot form the root causes we have found into at least one group that is both necessary and sufficient, then there will be other (perhaps more useful or informative) causes out there.

Which is the bigger problem? The one you’re analysing, or the time spent doing the analysis?

The author has decided that he wants to prevent the wastage of the five minutes that it took him to fix the flat tyre. He hasn’t described how often he has a flat tyre, which is a critical factor when considering how much time to invest in fixing the root cause. In this case, spending more than a few minutes thinking about why it happens is quite possibly a waste of time. Or, as the author himself puts it later in a comment: “My point remains that some root causes are not worth fixing. The real issue is that the cost of fixing a root cause can sometimes outweigh the cost of living with the problem.” Fair enough, but sometimes all that means is that you’re looking at the wrong root cause. The point Mike’s actually made with his example is that the effect or problem is not worth analysing.

Why the Five Whys?

I’m not convinced this is the best technique for root cause analysis. I find that the Five Whys technique on its own is most effective when investigating motivations for a requirement, because it suits a more linear cause-effect model that I find prevails in why people say they do the things that they do. For example:

  • Why do we need to upload the file in this format? Because that’s what the suppliers give us.
  • Why do the suppliers give us that format? Because that’s what we tell them to use.
  • Why do we tell them to use that format? Because we think it’s going to be easy for them.
  • Why do we think it’s going to be easy for them? Because we find it easy to manually work with.
  • Why are they manually working with this file? We don’t know, we just assume that they are, because at the moment we’re manually working with the files.

There’s useful information not just at the end of the chain, but all the way through – and there are still opportunities to branch your questioning.

Is root cause analysis less useful for software development?

In the original article, the author says:

“My point? Sometimes root cause analysis isn’t worth it. As with any tool or technique, it can be misapplied. A great deal of software development is novel–the thing being done or thought about is being done for the first time. (At least by that individual or team.) Root cause analysis is most useful when situations are repeated–as in manufacturing. The technique is less useful with novel, complex work such as software development (or even product development more generally).”

I’m very sorry, Mike, but I’m afraid I have to disagree. Yes, you’ve made the case that sometimes root cause analysis shouldn’t be done, particularly when the impact of the problem is small compared to the effort involved in looking at it. Yes, it can be misapplied. I’d definitely back you if you were to argue that it’s overkill for most bugs, where once the problem has been identified it can be fixed and we all move on.

But Mike hasn’t made the case (particularly with that example) that root cause analysis isn’t helpful with novel and complex work. I find that root cause analysis is in fact particularly useful with complex work, because there are usually a multitude of factors. Sometimes, if you find the right one, it’s possible to make a small change to effect a bigger one. It’s also been my experience that major and serious issues in software development come back to a sadly familiar set of root causes. With a quick rebranding, the Manufacturing groups can easily apply – Technology, Process, Data, Mindpower, QA, or Environment.

A different example

Over a decade ago, I was part of a team who were in a grotesque hurry in the middle of a death march, somewhere at the beginning of what was to be six months of 80-hour weeks, and approximately three months out from a go-live date that was carved in stone (or at least written in legislation). I received a data summarisation process and report to the client for user acceptance testing, for which I’m pretty sure I had provided the requirements document. The data summarisation process was to run daily, to provide the summary information for the report that would run monthly, because the development team wasn’t happy with the report performance.

One test run told me that a) the output was wrong and b) the performance was appalling. A re-run using production hardware and database configuration told me that it would take approximately three and a half years every night. A superficial ‘five whys’ analysis might lead us to believe that the root cause was a bad design decision, or poor requirements documentation. Maybe we could fix that by shouting at the designer or business analyst for a bit. (I confess that shouting at the designer made me feel better at the time – sorry, Graham.) But a more in-depth analysis, maybe using Ishikawa’s techniques, would perhaps conclude that we had at least the following factors at play:

  • Technology
    • The development team didn’t have access to production-quality hardware.
  • Process
    • The release process didn’t include performance testing, even where potential performance issues had been identified.
  • Data
    • The development team didn’t have access to a production-size data set.
  • Mindpower
    • All minds involved were working long hours under pressure to deliver.
  • QA
    • The testing that was performed, did not test the actual requirements.
  • Environment
    • The client and the development house were separated by thousands of kilometres and several time zones.

For this organisation, these were all causes that had resulted in delivery issues in the past, and (since this kind of analysis wasn’t actually performed) would continue to result in delivery issues.

Root cause analysis can work for you

Mike argues at the end of his article that root cause analysis shouldn’t be automatically performed, but done on the right problems, and I agree completely. Furthermore, done properly on the right problems, it is more than just an exploration of a single chain of causality. It examines the multitude of causes that contribute to the result, and attempts to determine where best to make the changes that will have the greatest effect.

In the rare cases that software or IT organisations are self-confident enough to examine why things went wrong, and to commit to fixing the root causes, I believe they’ll find themselves well rewarded. To argue that software development is novel and complex, and that therefore root cause analysis is less useful is, I think, flat-out wrong.

Don’t forget to leave your comments below.

The Bermuda Triangle for Projects

We all run afoul of this from time to time – it’s known as the Project Management Triangle or the Iron Triangle. It encapsulates the tension between schedule, cost and scope that occurs on any project, and it appears in a number of forms. Over time, as the Wikipedia article demonstrates, people have explored how it’s not quite that simple. In constructing a house, for instance, you can still achieve the essence of the scope of the project (e.g. all the rooms) in fixed time and cost by sacrificing quality (e.g. a formica bench instead of granite, or cheaper carpet).

It is also a critical factor in how we as Business Analysts prioritise business, stakeholder, or solution requirements. For each requirement it should be possible, if you are so inclined, to measure time and risk and cost and the value delivered by the scope of the requirement. In practice, prioritisation is typically done by a subjective stakeholder agreement session.

Understanding What’s Wrong

The Iron Triangle reaches its nadir in the phrase: ‘Fast, cheap, good – pick two.’ I have two issues with this sentence.

The first issue I have is that it encourages the idea that it is always possible to actually complete a large project on time and on budget. (That’s not what it says, but it is how it’s interpreted.) Sadly, for most Information Systems (IS) projects this is more due to good fortune than good judgement or hard work. The second issue is that ‘good’ is interpreted as referring to quality rather than scope. A while back I worked on a project where the stated priorities were time, then cost, then quality – “but we’re not sacrificing quality at all”.

Here’s why I think having an Iron Triangle of time, cost, and quality is a bad idea on an IS project. Scope in this context is implicitly fixed, even if it’s not fully understood. But when we sacrifice quality, what we’re actually sacrificing is uncontrolled scope. This is because quality usually degrades unevenly across the scope of the project, although it may degrade more in specific areas. If a piece of software has poor quality, it means it doesn’t do the job that’s being asked of it to the standard required by the client. I don’t mean that the solution must instead be perfect; rather, that it should meet the agreed and accepted quality targets. If it doesn’t then the client will not accept the solution, either actively by rejecting it or passively by not using it.

For the example below, imagine our scope as a list of 100 requirements, grouped so that requirements near to each other are part of the same scope area. As the project proceeds, analysis uncovers 10 additional requirements, which we insert in their appropriate places in the list. This causes the scope to expand.

Figure 1- time, cost and quality.

wickham Img01Figure 1 demonstrates what happens if we sacrifice quality. At some point (usually late) during the project we find out that due to poor quality, the requirements will be incompletely met across all requirement groups. The solution doesn’t work, and it’s not going to work within our ‘hard’ constraints of time and cost. To deliver a complete solution we now have to choose whether to:

  • allow an increase to time and cost, i.e. overrun the project, in the name of delivering the expected and agreed scope;
  • make scope cuts to meet the budget by stopping work on some of the groups of requirements, against the backdrop of an expectation that the full scope will be delivered;
  • or, more likely, both.

What We Can Do Better

Figure 2 – time, cost and scopewickham Img02

In contrast, Figure 2 shows the result of sacrificing scope. As analysis proceeds we determine early that we won’t meet the budget, because our understanding of the size of the scope has increased. We reduce the scope in agreed areas as early as we can. This allows us to manage the savings that we need to continue to provide the best value, instead of being driven by the quality of the delivered functional areas. It also allows us to set expectations early about what scope will be delivered, and what workarounds or alternative solutions might need to be put in place.

When we get the opportunity to work with stakeholders to determine how we define project success, project sliders can be useful. I prefer to use sliders in such a way that no two sliders can be on the same setting. This means you need as many settings as you have sliders, but they prevent someone from setting time, cost and quality as 5 and everything else as 0, for instance. This isn’t a problem that we fix with tools, it’s about education and setting expectations. And sometimes we’re just not in a position to influence these attitudes.

When we’re working with stakeholders to prioritise requirements, we need to encourage them to think clearly about what their priorities are, and the implications of setting those priorities given the funding and schedule. We also need to make sure that the stakeholders can trace the requirements to their relevant benefits and understand the relevant costs, benefits, risks, compliance aspects, or whatever factors they have agreed to be used for prioritisation.

For example, using MoSCoW priorities:

  • If it’s a Must-Have, then if we can’t find a solution, we need to reconsider the project viability. (Are you sure it’s a Must-Have? Does the business value of the requirement justify this?)
  • If it’s a Should-Have, we’ll do it if we have time and budget left over. (What are the relative priorities of the Should-Haves?)
  • If it’s a Could-Have, then we’ll only do it if it’s effectively free.

We need to do this whenever we’re deciding which requirements to proceed to the next stage with. For a Scrum project, this might be at each sprint. For an iterative project, this might be at the end of the stakeholder requirements analysis and the end of the solution requirements analysis for each iteration.

If instead the stakeholders persist in talking about time, cost and quality, it encourages them to think of scope as being fixed and the ‘everything’s a must-have’ mentality. It makes it harder to admit that they need to make the tough call early, and either increase schedule or cost, or decrease the scope. When they finally need to make the call, it’s more expensive and they have fewer choices. 

That’s when the stakeholders need to admit that they didn’t have a priority of time, cost and then quality. What they actually needed was the ability to make transparent and informed choices about changes to scope, cost, and time. We as Business Analysts contribute to this by performing sufficient requirements analysis to identify those prioritisation factors, and by encouraging stakeholders to review their prioritisation at the right times.

Don’t forget to leave your comments below.