Tuesday, September 15, 2009

METRIC OF THE MONTH – ERROR RATES

So in the last couple weeks, Metrics have become a candidate for the 7th plague.

Good God, y’all.

It rarely bothers me when someone disagrees with me. I actually enjoy debating and talking about our field. Sometimes after a really great discussion, I find I was wrong. That’s part of what makes it all so much fun. But I have pretty strong feelings in regards to doing what you need to do to make yourself, your team, and your company successful and that does not include being stubborn about metrics. Overall what I feel in regards to fear and loathing of metrics is…pity. I hate seeing extraordinary talent limit themselves. And avoiding metrics can be a career-limiting move, particularly if management is your goal.

So it’s up to you – willing to listen and maybe add a few tools to your tool belt? You don’t HAVE to use every tool you’ve got, just be comfortable enough to use them when you need them. Yes? Great - then let’s take a look at some metrics that can Do Good Things for you, your team, and your company.

There are a bizillion different types of metrics out there and I’m going to start with one of the most common – percentage of error. It can be the overall percentage of error, the error in a given environment (usually production), or the percentage of error for a given project/product effort. I’ll start out with the most basic concept and we’ll slice and dice it down into a few commonplace sets of numbers.

Most of the “useful” metrics I’ve seen are pulled to help answer a question or help diagnose a problem. They can also be used as a way to monitor progress; we’ll talk about that later.

When you’re interested in the percentage of error overall or in a given environment, generally it is because you have questions or problems such as:

1. There have been an unusually high number of problems reported by your clients
(whether those are internal or external) and you’re trying to figure out what
happened;

2. People are complaining that the test team is missing too many errors and you’re
not sure what is true or where to focus your efforts;

3. Either you or another group has been working hard on doing (something) better
and you’d like to see if it made a difference;

4. Sometimes the company is happy with the test team’s efforts and sometimes it
isn’t. You’d like to “take a temperature” after each effort to get ahead of the
game and help figure out how to provide more consistent service;

5. You always have issues with the same project team or product line in
production and you’d like to figure out if what you perceive is an issue is, in
fact, an issue. If you can see the same trends or problems numerically, you’d
like to present the information to the PM and executive managers in a way that
will encourage them to take action.

These are just a few examples of the kinds of things that can at least be partially addressed by pulling and analyzing some numbers.

***DANGER…DANGER, WILL ROBINSON….***

Only you or your team (management or other) can decide if a number is “good” or “bad”. Numbers have to be interpreted intelligently. In this case, “percentage of error” is, in and of itself, meaningless. A company’s tolerance for error varies widely between industries. And even corporate culture impacts what is considered “good” or “bad”. It’s for this reason that trying to compare your own company to an “industry standard” can be such a bag of worms. No two companies are identical, so you may be trying to compare yourself to an organization ten times your size, with ten times your budget, etc. Industry standards can be interesting and can even be used to help establish a goal (“something to shoot for”), but the enormous number of types and sheer volume of companies that feed the averages that make up any “standard” means it’s not really a standard FOR YOUR COMPANY. Or ANY OTHER COMPANY. It’s nothing but a ginormous average made up of data from a lot of companies.

For that reason, I believe it is far more valuable to you and to your company to determine and establish your own “standard”.

So let’s start figuring out where you stand, statistically speaking, right now.

There are several ways to determine the percentage of error, either overall or in a given environment.

The first method, which I prefer, requires that you know how many tests (or conditions or scripts or charters - you get the idea) were run and how many errors were found. We’ll start with the most basic number – the percentage of error found across all environments. By the way, if you’re counting all of the errors found throughout the process, including production, you’re going to have to figure out when to stop counting. In our case, I reviewed all of the errors found in production for 4 weeks and found the number of problems reported due to new software installs dropped off dramatically after two weeks. So my cutoff point, at this company, is 2 weeks after new/changed code is migrated to production.

The formula is:

(# errors found)/(#tests run)*100

For example, if 50 errors were found and 200 tests were run, you would have:

50/200*100 = 25%

So your overall error rate is 25%. Is that good or bad? Neither. It’s a number. If your clients call to complain in droves, die, or your boss demands your head on a platter, I’d say it’s “bad”. If you get a board commendation, an invitation to play golf with the CEO, and your coworkers carry you around the building cheering and chanting your name, I’d say that’s “good”.

But you now have a baseline, which through analysis and interpretation, you’ve determined is good or bad for your own environment.

So let’s slice and dice this data further.

Say you’d like to know, of that overall percentage, what was found in the test environment by your testing team and how much was found by your clients in production.

The formula is the same, but you use the number of errors found in each environment.

For example, say 10 errors were found by your test team in the test environment. 40 errors were found by clients in production. The math looks like this:

10/200*100 = 5% (found by your test team)

40/200*100 = 20% (found by your clients)

As a manager, I’d be interested in why our clients were finding more error than my test team. And this is one of the big benefits of taking metrics. This number would raise questions in my mind and I would go ask them. Metrics can be of benefit to point out something unusual and spawn questions that need to be asked. The above situation might not be an issue, or might not be a serious issue. Perhaps your clients are expert and vocal, and the 20% are primarily enhancements they’d like to request. And perhaps it IS a problem with your testing efforts. Regardless, finding answers to questions raised through even the most simple metrics can require a great deal of investigative analysis.

Let’s take these numbers further. Perhaps you’d like to know what the percentage of error is by function. You can do this two ways. If you use the same formula as above, it would be:

(# errors found in X function)/(total # tests run)*100

Notice there is some consistency here. If you’re going to pull metrics, I’d suggest deciding on one way to get the information you need and sticking to it consistently.

There are other ways to determine the error rate of a given function, but the above method tells you the percentage of error for a function when considered for the entire testing effort.

You could also use the following math:

(# errors found in X function)/(total # tests run for X function)*100

This doesn’t tell you how X function fared in comparison with other functions tested during the test effort, but if you’re uninterested in indicators that show you (for example) that you always find more error in function A than function B, then the above formula is fine.

If your questions involve curiosity as to where MOST of your errors were found, environment A, B, or C, you don’t even need to know how many tests were run. You can simply use the number of errors found in this way:

(# errors found in X environment)/(total # of errors across all environments)* 100

Tired of generic match yet? Then let's move on. What can you DO with this stuff? (now, now – be polite).

I can only tell you how the companies for which I’ve worked have used them. When I started working in my current company, one of the problems they wanted to solve was “too much error in production”. The business users were complaining. The testing group was viewed as barely competent. The IT group was viewed as inadequate. Every migration was a debacle, with roll-outs, emergency patches, and the like.

So we took a “percentage of error” baseline. The results? Our error rate was over 49%. The business users found more error in production than we did.

So we dug down deeper, and found out why. I won’t go into the many issues that fed this problem, but we put together a game plan and made a proposal to management, using our error percentages as baselines and making suggestions specifically to address “bringing the error rate down”. I'd like to make the point here that one of the reasons our proposal was accepted and action taken was because our initial numbers were accepted, recognized as less than ideal (boy howdy), and were, as just numbers, a non-offensive way to get a point across. One of the reasons executive management likes numbers is the lack of emotionalism involved.

Our error rate today is between 3-4%. We are respected. The IT organization is respected. We haven’t had to roll back or interrupt service for emergency patches. My test team kicks butt.

None of that happened overnight. It took a lot of change, intelligently implemented, over time. We had the cooperation and support of executive management, development. operations, and architecture.

Metrics made those things possible and gave us a goal, “something to shoot for”, and a way to measure our progress. And it did that in a way that was not confrontational, emotional, or accusatory. Numbers are not emotional, confrontational, or accusatory. They’re just numbers. For every migration, we pulled the same metrics, in the same way, to determine if the changes we’d made seemed to make a difference. So we could see if our numbers were trending up or down. Without something like metrics, we'd just be going on our "feelings" and "opinions". There's nothing wrong with feelings and opinions. But in my experience, it's really tough to get funding based on either one.

At this time, our regression test case base has become big enough that pulling the overall error rate is no longer useful for us. Consider that if you have 100,000 tests and a 10% error rate, that’s 10000 errors. Is 10,000 errors “good”? Probably not. So I’d say that once your regression test case base becomes significant, you’d want to move to pulling metrics on an application or functional basis.

And that, to my mind, is another “rule of metrics”. If they aren’t doing anything useful for you, throw them away. Once you make progress like we did, you don’t need to prove, over and over again, that your error rate is low. I’d focus on an area that needs some improvement or investigation.

And I never get involved in Evil Metrics. From the above numbers, you can extrapolate that it would be easy to determine errors associated with a module or an individual’s work. If asked to do so, I gently refuse. I have refused. Many times. You cannot judge a developer by the number of errors that manifest in their code. They may have double the workload of the rest of the team or the most complex parts of the system to work on. The furthest level of “slicing and dicing” I do is down to a function level; one level above any chunk of code that can be attributed to a single developer.

If you’ve never done this before, why not try it one time? You don’t have to show anyone or use the information, but why not find out where you stand at the moment and get some practice in? Many people want to know what their numbers “mean”, and really, it depends on your company. There are several “industry standards” out there. I can only tell you the Linda Wilkinson industry standard. I’ve handled more than 250 projects, from retail, .com, to banking and aviation. What I’ve found is that if my error rate in production is over 20%, it’s likely that I have something messy on my hands that will cause my end users some pain. If I have less than 10%, the migration is going to “stick”. But those are based on personal experience. All it means is that if I see a number over 20%, I’m going to investigate the whys and the wherefores and see if there’s something we can do to make things better.

And again, I'm well aware that even one error in production, if it's a bad one, can be heinous. But that's NOT WHAT METRICS ARE FOR. Metrics do not really indicate exceptions or individual cases of either goodness or badness. They are merely averages and are useful as overall indicators. If you pull any one single instance of anything, averages do not apply. If you have an extremely small base of tests, like 2, averages will not be of benefit. If you don't have a clue as to how many tests you run, then obviously this set of metrics is useless to you. But that doesn't mean they're useless to everyone or useless in general. They just don't apply to you right now. Someday, they might come in handy, particularly if you need to figure out where you are, establish a baseline, set some goals, or make presentations to executive managers in a way they can understand and accept.

I’m going to talk about defect statistics in my next ‘Metric of the Month”. When I end this series, I’m going to post a copy of our metrics report, and probably wrap up with how to refute metrics that are bogus. Once you know how to run them, you also need to know how to unravel them. If your career follows the same path as mine, you’ll need both skills….

OK, everyone. Target practice is officially open!!!! Pull!!!!!

8 comments:

Marlena Compton said...

I agree with you that there seems to be a lot of "metrics phobia" floating around.

Your discussion of defects being discovered by clients vs. staff is interesting. This would be interesting data to try visualizing with Parallel Sets

Rob Lambert said...

Linda,

Really great blog posting. It's always good to see some balance between theory and idealism versus reality.

I too have worked in places where metrics were needed and they helped me lots. Right now I'm not needing them, but as soon as I do, I'll start using them again.

I think the key point you make is that they are just a number. Nothing more, nothing less. How we inteprate that number is based on our own needs, perceptions, ideas and aims.

Looking forward to the next blog.

Rob..

SandeepMaher said...

Thanks for the post. You never fail to entertain even while you educate. Let me tell you that I am a numbers man myself and I use them with both pragmatism and care. So relax and keep that gun and shield away.

One of the metrics we capture is the Phase Yield (across Development, Test and Acceptance Tests) which is a percentage of defects (duly weighted by severity) found in a particular phase against total defects across all phases.

Our goal is to find more defects in our tests as opposed to those in Acceptance which is expressed as a percentage of System Test defects against total of System + Acceptance tests defects. I call this 'Quality of Testing' metric.

We also keep a track of defects reported from Production. But I do not understand your suggested use of tests as a denominator since the tests would never be run by the customers.

Joe said...

"So your overall error rate is 25%. Is that good or bad? Neither. It’s a number. If your clients call to complain in droves, die, or your boss demands your head on a platter, I’d say it’s “bad”. If you get a board commendation, an invitation to play golf with the CEO, and your coworkers carry you around the building cheering and chanting your name, I’d say that’s “good”."

Here's where I find fault with this thinking. Trying to ascribe customer complaints, or lack thereof, to a metric is iffy.
My contention is that the "bad" or "good" responses here aren't actually related to the overall error rate. What you really want are "happy clients", and not "a good metric value".

For example, using your formula, if you increase the number of tests you run, your Error Rate could easily decrease, while the number of errors found could remain exactly the same. Would you expect a corresponding increase in golf invitations?

And if a platter is being prepared for your head, what should your response be? Should you tell your team to "find fewer bugs"? That will certainly decrease your error rate.

Thank you for starting this discussion, Linda. And thank you for never getting involved in Evil Metrics. While we may all disagree on which metrics (if any) are un-evil, I am still glad to hear it.

-joe

Linda Wilkinson said...

Joe,

I wasn't trying to ascribe customer complaints to metrics; I was trying to explain that once you have a metric, you have to determine whether that number is "good" or "bad" for your own company. You might have a 2% error rate and your customers might think that's terrible. Or have an 86% error rate, and clients that love you. "Good" and "bad" need analysis. that said, I have found that when our clients are unhappy and have a lot of complaints, our metrics usually indicate we had a problem as well. Metrics are great as generic indicators and bad as specific answers to questions or problems. We use them here as a "jumping off" point to ask questions.

If a platter were being prepared for my head, it wouldn't be because of metrics. It would be because of some problem that might have been highlighted by metrics and later found through analysis to be something I or my team missed. I've never, in my entire career, told my team to find either more or less bugs. In fact, I don't know how you extrapolated that kind of scenario from my post. That's really "tampering" with the numbers, and I put it in the same league as Evil metrics. I believe metrics are good for generating questions, helping with analysis, setting goals, and highlighting or explaining successes or challenges to executive management, but they are non-specific averages and always need analysis.

- Linda

Linda Wilkinson said...
This post has been removed by the author.
Linda Wilkinson said...

Sandeep, I understand your point in regards to clients not running tests in production. The total tests run is used as a denominator because we are evaluating the efficacy of the the testing process overall. We do run several metrics that are more specific, but I'm starting out with basics for this series. I think the formula I gave will help people get started and give them some useful information to kick off their analysis efforts. I'm going to talk about defects for the next "Metric of the Month"; we do something very similar to your process.

Thanks!
- Linda

Joe said...

Rather than diving into a discussion about "correlation versus causation" with regard to Error Rate and Customer Complaints, I'll just say - if it works for your shop, then it's great!

I wasn't suggesting that you or your staff tamper with the numbers. I was only pointing out the math involved in determining the Error Rate, and that I've seen and worked at shops that "play" with factors in order to manipulate their metrics (either intentionally, or unintentionally).

I'm not sure I know how to divide metrics into Evil versus Non-Evil, as I belive metrics aren't inherently evil. Like technology or weapons, all metrics can be used for good or bad.

This is a good discussion, Linda - thanks for tackling it.

I'll look forward to more Metrics of the Month. If I find one that I hadn't heard of and tried before, I'll give it a shot.

-joe