Exploring Possibility Space: How to aggregate ground-truth metrics into a performance index

My remix of a painting by William Blake,
with the Meritology logo added. Get it?
He's shedding light on an impossible shape.
(Click to enlarge)

The general problem is this:

How can we measure aggregate performance on an interval or ratio scale index when we have a hodge-podge of ground-truth metrics with varying precision, relevance, reliability, and that are incommensurate with each other?

Here's a specific example from the Ten Dimensions:

How can we measure overall Quality of Protection & Controls if our ground-truth metrics include false positives percentages, false negatives percentages, exceptions number of exceptions, various "high-medium-low" ratings, audit results, coverage percentages, and a bunch more?

I've been wrestling with this problem for a long time, both in information security and elsewhere. So have a lot of other people. I while back I had an insight that the solution may be to treat it as an inference problem, not a calculation problem (described in this post). But I didn't work out the method at that time. Now I have.

In this blog post, I'm introducing a new method. At least I think it's new because, after much searching, I haven't been able to find any previously published papers. (If you know of any, please contact me or comment to this post.)

The new method is innovative but I don't think it's much more complicated or mathematically sophisticated than the usual methods (weighted average, etc.), but it does take a change in how you think about metrics, evidence, and aggregate performance. Even though all the examples below are related to information security, the method is completely general. It can apply to IT, manufacturing, marketing, R&D, governments, non-profits... any organization setting where you need to estimate aggregate performance from a collection of disparate ground-truth metrics.

This post is a tutorial and is as non-technical as I can make it. As such, its is on the long side, but I hope you find it useful. A later post will take up the technicalities and theoretical issues. (See here for Creative Commons licensing terms.)

Your Goal: Measuring Performance using an Index

You are a manager. You want to measure performance in some general function and you want the measure to have some analytic precision and credibility -- something better than "Meets Expectations" vs. "Exceeds Expectations".

You have a definition of performance, at least intuitively, and you know whether it is bounded (i.e. has a defined maximum and minimum) or is unbounded (i.e. no upper or lower limits), or a mix.

You want a performance index, which is a predefined numeric scale of performance. It can be open ended on one or both sides-- e.g. zero to infinity -- or closed-ended -- e.g. one to ten. Most important, the intervals in the scale of a performance must be proportional to degrees of difference in performance that is is meant to represent. If your scale is 1 = Worst to 10 = Best, then what you'd want a change from 3 to 4 to mean the same degree of performance improvement as a change from 8 to 9. Lastly, you'd want some degree of precision or granularity in your performance index -- just enough to make useful decisions, and no more. If you can make good decisions using a 1 to 3 scale, or a 1 to 10 scale, then you don't need a 1 to 100 or 1 to 1,000 scale.

The method I'm describing will work for any performance index that fits these specifications. You can even change your performance index in mid-stream and keep most of the logic.

Your Data: Ground Truth Metrics and Indicators

You have some metrics or indicators related to this general function. They may be binary -- e.g. pass/fail results from an audit, or presence/absence of some control -- they may be percentages, counts, ratings, scores, or whatever. They might even be descriptive categories. (An 'indicator' is just a metric that takes on two values, but only one value has significance, i.e. the presence of some condition or state of affairs.)

Right now, you may have no idea how to combine these metrics and indicators to measure aggregate performance. They have varying degrees of quality, precision, accuracy, reliability, relevance, and interdependence. To analyze and communicate these metrics, maybe the best or only tool you have today is some form of metrics dashboard. (By the way, I think dashboards are dumb most of the time.)

If you don't have ground-truth metrics or indicators, the method below won't help you. But you don't need a perfect list or complete list. In fact, you can add or drop metrics and indicators as you learn more. That's the whole point of this method.

The Usual Method: Weighted Average

Let's say you want to estimate a performance index on a five point scale, from 1 (worst) to 5 (best). Let's say you have five ground-truth metrics, and assume that you think they are equal in importance relative to overall performance. Assume that these five metrics are all on the same 1 to 5 scale (because they are ratings, perhaps). In this very simple case, you can compute the performance index as the simple average of the five metrics, since they are all on the same scale and resolution as the performance index. If the metric values are: 4, 2, 1, 5, and 3 then the simple average is 3.00.

More commonly, we don't attribute the same significance to all metrics -- some are more important than others. The way to handle this is using the weighted average (or weighted sum) method. Using the same metric values as above, here's how a weighted average might compare:

Now consider the more typical case where the metrics are not on the same scale as the performance index, and not on the same scale as each other. They might be counts, ratings, boolean values, percentages, or categorical values. Let's say our metrics have the following scales:

Boolean
Count (zero to infinity)
Percentage
Percentage
Percentage

In this case, each metric needs to be pre-processed to convert it's values to the scale of the performance index. But we can't just create one pre-processing formula for each metric scale type. Not all metrics measured in percentage mean the same thing relative to aggregate performance. Therefore, this pre-processing formula needs to include some transformation that maps what is important on the metric scale to what is important in the aggregate. Here's how this would look in with our example metrics:

This pre-processing function is doing two things at once -- translation between scales, and transformation in values to preserve what's important on the given metric scale. And you need one pre-processing function for every ground truth metric.

Less obvious: the pre-processing functions are interdependent! They have to be defined using some common rules and conventions, because otherwise one metric would have too much or too little influence on the aggregate performance index, even if they have equal weights. Take the first metric, which is Boolean. The pre-processing formula assigns '4' to TRUE and '2' to FALSE. But if 5 and 1 were chosen instead, then metric #1 would have a more extreme influence on the performance index, either way. Likewise, you could pick TRUE = 3 and FALSE = 1 to slant the influence in the negative direction, and vice versa with 5 and 3.

Given these complications, and given that these pre-processing formulas are usually buried in spreadsheets and no one maintains common rules and conventions, it's easy to see why some people view these pre-processing functions as fudge factors, and the resulting performance index can resemble "mystery meat" because no one really knows what's in it.

When your pre-processing functions become incomprehensible and arbitrary fudge factors,
the resulting performance index can resemble "mystery meat" -- easy to consume but what's in it??

Complications That the Usual Method Can't Handle

The Usual Method (weighted average method with pre-processing) has a hard time with many real-world complications in metrics and how they relate to aggregate performance. In a later post, I'll examine this in more detail, but for now I'll just list some the complications:

Measurement Error or Uncertainty -- uncertainty in the actual value vs. measured value of metric
Vagueness -- the metric only loosely measures the underlying phenomena
Reliability -- some metrics are more reliably signals than others.
Relevance -- the metric may not be equally relevant to the full range of the performance index
Logical Dependence -- the interpretation of a metric may depend on other metrics, or whether those other metrics are above or below some threshold
Path Dependence (i.e. history matters) -- current interpretation of metrics may depend on past history of that metric, or variability of past values.
Plural Interpretations -- a single metric may have several valid and important interpretations. For example, you may be both concerned about the absolute value of a metric and also whether it is above a threshold (i.e. it functions as a signal or trigger).
Non-monotonic -- adding new metrics may force you to reconsider or re-evaluate your previous interpretations of metrics.
Bias -- some metrics are prone to bias that goes above and beyond what pre-processing can adjust for.
Missing Metrics -- since your weights have to sum to one, you can't leave room for metrics you don't yet have but might add in the future. They'd mess up the sum because their contribution would be zeros.
Missing Observations -- the Usual Method doesn't have a standard or reliable way to handle missing observations of existing metrics.
Changes in Measurement Method or Instrumentation

This isn't even a complete list, but even so, it's obvious that the Usual Method has trouble coping with any one of these complications, and it pretty much crumbles if you try to account for more than a few.

At this point, most people will give up. Now, there's a new method.

Introducing the "Thomas Scoring System" (TSS)

Forgive me if I do a little self-promotion, but I'm giving it a formal name (why?):

If you use methods described in this post, please include logo above and also this:

Thomas Scoring System by Russell Thomas
is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at
http://exploringpossibilityspace.blogspot.com/2014/02/thomas-scoring-system.html

The Essence: Inference, Not Calculation

Here's Thomas Scoring System (TSS) in a nutshell: Treat each metric and their values as evidence that weigh for or against specific values in the performance index. This is an inference process and not an arithmetic calculation, as in the Usual Method. The output of TSS is an estimation of the Weight of Evidence for all index values. The Weight of Evidence can be expressed as a probability distribution, where all the evidence weights sum to one. (There are other methods of representing it, but we'll leave that for later.)

The Lynch Pin: Your Performance Hypothesis

The inference rules that establish the relationship between the evidence (metrics) and the Weight of Evidence output are, collectively, called your performance hypothesis. This is nothing more than a formal statement of everything you know about how these metrics depend on each other and how they come together to inform your estimate of the performance index.

Very important: the TSS does not provide you with a performance hypothesis, like the Usual Method does. You need to think about this on your own, and you need to test it continually. However, as more and more people use the TSS, people will start sharing and learning from each other's performance hypotheses. Some Body of Knowledge may develop in professional communities. Furthermore, performance hypotheses are good targets for machine learning using Big Data, both empirical and simulated metrics and results. It could even be refined using "wisdom of the crowds" methods. More on this in later posts.

Estimating Weight of Evidence

While the performance index itself is not the product of an arithmetic calculation, you can use logic and arithmetic to calculate the distribution of Weight of Evidence. For most purposes, I think the following formula will apply:

The Logical Condition is any set of logical relations among metric conditions that, when TRUE, mean that this condition provides some evidentiary support for that particular index value (a.k.a. score value). Relevance is a number, or a function that returns a number, on some standard scale of relevance. In the demo below, the relevance scale is -1 to +1, with '+1' meaning fully relevant with positive implications, '-1' meaning fully relevant with negative implications, and '0' meaning not relevant. Significance is the conditional weighting factor, given that both the logical condition are true and relevance is not zero.

Before we go into a simple but practical example, let's compare the TSS to the Usual Method.

A Simplistic Example: Replicating the Usual Method

Going back to the example I started with in the section "The Usual Method: Weighted Average", above, I'll show how the Usual Method can be replicated using TSS. Note: I'm doing this just to help explain the TSS and to show how it is both similar and different than the Usual Method.

Remember, all five metrics have the same scale (1 to 5) and same interpretation, so I'll explain using just the first metric. Relevance and significance are same for all, and thus the focus is on logical conditions.

Here's the key question: If Metric #1 has a value of '5', what evidence does this provide for the alternative values of the performance index: 1, 2, 3, 4, or 5? Answer: It could provide support for any of those five index values, depending on what the other metric values are.

If Metric #1 = 5 and all the other metrics = 5 also, then this evidence strongly supports the performance index value of '5'. But if all the other metrics = 5, there is still some support for the performance index value of '5' -- just not as much as support for '4'.

The evidence "Metric #1 = 5" can even support a performance index value of '1' if (somehow) all the other metrics = 0 -- i.e. maybe they are missing observations.

Extending this thinking, it is possible to set up a logic table that relates the metric values (i.e. "conditions") to each other, and to the relevant values in the performance index. In this way, the TSS method can perfectly replicate the Usual Method. (Of course, the Usual Method is simpler and easier in this case, so it would be preferred.)

But, what matters most is that the TSS can go far beyond the Usual Method in handling many or most of the complications mentioned above, and can do so in a way that is much more manageable and understandable.

A Simple Practical Example

First, download this spreadsheet (No macros or VB. It should run fine on Windows, Mac, and Open Office.) The screen shots below are from this demo spreadsheet, but you'll learn more by playing with it and studying the cell formulas. The spreadsheet has instructions and also some explanations.

In this example, I'll walk you though a hypothetical case where we want to use these five metrics to estimate a performance index on a five point scale:

I've purposely chosen metrics that don't obviously "fit together" quantitatively, and also that might have questionable value to estimate aggregate performance. (InfoSec pros would certainly argue over some of these.) In other words, this is a messy metrics list that is fairly typical for most organizations getting started.

Demo Performance Hypothesis

The first step in defining your performance hypothesis is defining the logical conditions you care about for each metric. You can have as many or as few conditions as you want, and there is no need for them to be consistent. Here's a table listing all the conditions in the demo:

The first column lists the condition number, and the number in parentheses is the metric number it applies to. (I could also define conditions that apply more than one metric just as easily, i.e. "if pass audit = no AND critical vulns > 0".) Notice that I'm testing for two things: the "Evidence" associated with this condition -- is the logical condition TRUE or FALSE given the current metrics values -- and "Absent", which is TRUE if the metric value is blank, and FALSE otherwise. This gives us the ability to explicitly account for missing observations or missing metrics.

The second step is to define the logical relations that apply to each performance index value. In the demo spreadsheet, I do this with a table with conditions as columns and index (score) values as rows. Every cell in the table defines a logical relation between a single condition and a single score value. In the demo, I use very simple logic: Is the condition TRUE for this score value, or is it FALSE?

I'll walk through two conditions to make this clear. (The column headers are Condition Numbers, and row headers are score values, a.k.a. index values).

Condition #1 is TRUE only when the "pass audit?" metric is "yes". In my performance hypothesis, I believe that passing an audit is not very informative if I'm above a minimal level of performance. Therefore, I expect condition #1 to be FALSE when the performance index value = 1, and TRUE for all the other values. My relevance scores decline as the performance index goes up, indicating that this condition is most informative for lower scores and least for higher scores.

Condition #2 is TRUE only when the "pass audit?" metric is "no". In my performance hypothesis, this has information value different than condition #1 because failing an audit provides more information than passing it. Important: My relevance scores are negative for score values '4' and '5', meaning that failing an audit should take away from or discount other evidence in support of score values '4' and '5'.

Relevance allows you to account for vagueness, uncertainty, or imprecision in any metric. The more vague or uncertain the metric, the more conditions that will apply to any given performance index value.

The third and final step of the performance hypothesis is the Significance weight assigned to each metric and all of its conditions, or alternatively to all of the individual conditions. As a set of weights for each score value, they need to sum to one. Here is the "user-entered" weights that represent my performance hypothesis in the demo.

Now I'll show you a series of scenarios for metric values and how they yield informative estimates of the performance index.

No Data, No Clarity

First, consider what the Weight of Evidence looks like when we have no values for any metrics. This is a state of ignorance where we have "Known Unknowns" but not much else. Here is the output:

No metrics, no clarity. (click to enlarge)

Notice that the Weight of Evidence is equally spread over all score values. All the information for Weight of Evidence is contained in this graph. However, it's sometimes useful to summarize or evaluate the distribution, and you'll see several statistics in the upper right. These are optional statistics. Use them only if they help you and take them for what they are worth -- partial views on the distribution. The first two are self-explanatory, but the others merit explicit definitions.

Clarity is a measure of how much mass is under a single value (Clarity = 1.0) compared to mass distributed to all the other values. If all the Weight of the Evidence is equally distributed, then Clarity = 0.0.
Ambiguity is a measure of whether the mass is distributed to several widely separated values ("peaks"). If all the mass is under a single value, then Ambiguity = 0.0. If the mass is equally divided between to values far apart (e.g. 1 and 5) with no mass on other values, then Ambiguity = 1.0.
Most reasonable score is the result of a formula that tries to account for the effects of the other statistics: "Score with most weight", "Weighted mean score", "Clarity", "Ambiguity". It is one way of answering the question: "Given this evidence, what one score value is the most reasonable or justifiable?" If both Clarity and Ambiguity are zero, then it returns the answer "none".

Summary: if we have no evidence from metrics then we have no clarity. Any performance index value is just as reasonable as any other. (FYI, this is consistent with subjectivist probability theories of evidence, though, strictly speaking, I'm not using the math of subjectivist probability theory.) This conclusion is strictly a sanity check for the method. In the absence of all evidence it yields an informative output.

Best Case, Clearly

Now consider what the Weight of Evidence looks like if all the metrics have values that are maximally positive, given the conditions.

Best case, clearly. (click to enlarge)

Notice that, even with maximal metric values, not all the Weight of Evidence is on the highest score value of '5'. Why? Because my performance hypothesis does not include conditions that specifically support the value '5' while specifically excluding the other values. This is reflected in the Clarity statistic, which at 0.34 is higher than 0.0, but far from unequivocal. Therefore, given what we can divine from the distribution of Weight of Evidence, the most prudent/reasonable score is about 4.1.

(This output uses the default Significance weights called "even". In your copy of the spreadsheet, change this to "user" and see what happens. Clarity increases to 0.43, with more weight on '5'.)

Also notice that this result informs us both about the metric values and also about the inference power of my performance hypothesis. This is information that is not available in the Usual Method. Here's the implication: if you don't like the inference power of your performance hypothesis, then reexamine your metrics, your logical conditions, and your relevance and significance scores. You may have too many metrics that overlap or are too vague, or you may not understand enough about what conditions apply to various performance index values.

Worst Case, Clearly

Here is the opposite of the previous case. All the metric values trigger the worst conditions.

Worst case, clearly. (click to enlarge)

Notice that Clarity = 0.53 (with Significance weights = "user"), which is higher than anything produced under best case conditions. This tells us that my performance hypothesis is better (stronger) at identifying low performing cases than high performing cases. How do I feel about that? How do you? What would you do to improve/revise the performance hypothesis? (food for thought)

Also notice that there are no combinations of metrics that yield a "Most reasonable score" of '5.0' or '1.0', even with rounding. What this tells me is that I don't have enough evidence (i.e. conditions) in my performance hypothesis that clearly identify and support these extreme cases. This points me in the direction of either expanding my portfolio of metrics, or pruning it (because of redundancy), or adding more logical conditions that apply to the extremes.

Regardless of the details, notice how this points you in the direction of what you need to learn next? That's what's really valuable about this new method, a.k.a. Thomas Scoring System, TSS.

Typical Case, Clearly

Now consider what the Weight of Evidence looks like when all the metrics are in their middle, expected values.

Typical case, clearly. (click to enlarge)

Notice that the Clarity is relatively high -- the same as in the Worst Case. This tells us that the performance hypothesis can distinguish a middle range case just as well as a worst case, and better than a best case.

Now on to some weird cases.

Mixed Case, With Serious Muddiness

Consider the case where the metric values are opposing -- some very good and some very bad. Compared to the first case where you have no information, you have a lot of information in this case. But look at the results:

Mixed case, with muddy results. Notice the wiggly lines after the '3'. (click to enlarge)

The central tendency (i.e. mean and mode) aren't very well defined. It's a broad peak rather than a sharp peak. The Clarity metric is 0.10, which isn't far above the value of 0.0 that signifies complete ignorance.

How should you interpret these results? Two ways. First, you should look hard at your security program, including people, processes, and technologies. Look for contradicting forces -- things that pull in opposite directions. For example, well designed controls that are poorly implemented, or the reverse. Second, you should look hard at your metrics portfolio. Maybe you aren't measuring the right things, in the sense of what will be most informative regarding aggregate performance. You might find that you have a significant number of metrics that really aren't very informative and just muddy the water. Get rid of them and replace them with metrics that have more meaning.

Conflicting Case... WTF?

Finally, consider the case where you have metrics which point strongly in opposite directions.

Conflicting case. Whiskey Tango Foxtrot? Can a score be both less than 2 and greater than 4?
(click to enlarge)

This won't be true for all performance hypotheses, but for many there will be cases where the Weight of Evidence is polarized into two (or more) peaks that are far apart. Call these 'conflicting' or 'ambiguous' cases because they can't be interpreted with simple averages or 'typical values'. In fact, averaging is about the worst summary operation you can perform in this circumstance, because it obliterates the most important information in the Weight of Evidence. By analogy, if you have one arm in hot coals and the other arm embedded in an ice block, it would be stupid to say, "On average, my temperature feels pretty good."

How should you interpret these results? Again, look hard at your metrics portfolio and the logical conditions you've defined. Maybe you've gone overboard to emphasize the extremes at the expense of the middle values. You should also look at your security program -- people, process, and technology -- for signs of pathological conflict and dysfunction. If your organization is both great and terrible at the same time, then something is holding that tensions and that 'something' might break unexpectedly and catastrophically.

Summary and Next Steps

I hope you've found this tutorial useful and enlightening. More than anything, I hope you find value in applying and extending the Thomas Scoring System (TSS) in your own organization. I hope many people can now create performance scorecards where before they only had a loose grab-bag of metrics, dashboarded or not. I think this is a major breakthrough, but only experience and time will tell.

Let me know what you think, and if you have some lessons from experience applying the TSS, please let me know.

Later posts will deal with the theoretical/conceptual/academic aspects of this. In those posts, I'll talk about antecedents and related methods and ideas. Also, I'll talk about more programmatic ways of doing the TSS (including R and other spiffy tools). I chose Excel for this introductory post because it is both adequate and it's the entry-level modeling tool for most people.

Last, if there are flaws in what I've presented -- major or minor -- please let me know.

Coda: About the Name

Yes, it's pretty cheeky of me put my last name on this method. Shameless self-promotion, you might think. Yes it is. Here's my justification. I'm sharing everything about this method. It's licensed to one and all with Creative Commons Attribution+Share-alike. Nothing will be held back as proprietary, trade secret, or protected by patent. I don't expect any direct compensation. But, by putting my name in the title and hoping you and others will use it and pass it along, I'm hoping to get a little boost in my personal brand value. It's a common business model in this age: give stuff for free and self-promote at the same time. Since, in essence, we are all self-employed, we all need a personal business model.

Of course, this naming convention is voluntary and will only work if it becomes socially accepted. Let me know if you are for it or if you are strongly against it.

(I was tempted to call it "Thomas Scoring Algorithm" but that three-letter abbreviation is already taken, ☺)

Exploring Possibility Space

Tuesday, February 25, 2014

How to aggregate ground-truth metrics into a performance index