Exploring Possibility Space: Dimension 9: Optimize Total Cost of Risk

Monday, July 8, 2013

Dimension 9: Optimize Total Cost of Risk

This is the ninth post defining each of the Ten Dimensions of Cyber Security Performance. Like dimension 8. Effective Agility & Learning, this dimension applies to performance of the cyber security program as a whole, and therefore I've located it to the side of the other blocks as if it were a higher level.

The dimension of Optimize Total Cost of Risk includes all of the processes that assess and manage cyber security at an enterprise level, specifically in terms of resources (financial and non-financial), liability (potential and realized), risk mitigation (including insurance), and also balancing these against the upside of taking risks (i.e. the benefits of exposing and using information and systems). Essentially, this is the financial side of risk management at the enterprise level.

Most existing cyber security frameworks either omit or exclude this dimension, or it is viewed merely as the aggregation of risk estimates for each and every possible attack on every asset, and every possible consequence.

In contrast, the approach I'm recommending for this performance dimension starts at the enterprise level and estimates costs associated with cyber security as a whole, both the costs of defenses and the costs of loss events. It adapts the loss distribution approach (LDA) for Operational Risk (OR) within Enterprise Risk Management (ERM) that has been pioneered and refined in the financial services industry. This is a fairly complex topic on it's own, so I won't attempt to give a full exposition here. There are references with further detail, below.

In order to make rational resource allocation and prioritization decisions at the enterprise level its very useful to have a financial measure of cyber security performance and results. But it's also essential to balance these costs against the flow of benefits, also expressed in financial measures. Unfortunately, this is not a straight-forward process using available accounting data, even in relatively small organizations. This is a nascent area of research with few examples of success in industry. Even so, the shape of a viable method is now coming into focus. I'll lay it out in tutorial fashion with the aim of getting across the main ideas.

Start by defining "total costs"

The starting place is some measure of total costs for any given period for a given organization. This is exactly analogous to the notion of "total cost of quality" (see this and this) that was promoted in the early days of Total Quality Management. Total costs of cyber security would include direct costs of staff and equipment, training and awareness, etc., and also indirect costs such as help desk load, compliance audits and reporting, and also costs of unintended consequences. But it would also need to include the costs of incidents and breaches, both direct and indirect.

Of course, its a matter of art as much as science as to how to define total costs. Thankfully, we aren't aiming to achieve accounting accuracy. Instead, we are aiming to have a measure that effectively support decision-making in a financial context. Therefore, the rules that any organization uses to define total costs should be judged on pragmatic grounds. For example, the cost categories that are the largest and significant regarding results should get the most attention and sophistication. Smaller cost categories can be estimated more roughly. As long as the resulting measure are reasonably broad and inclusive (i.e. embracing all 10 performance dimensions), and also internally consistent, then the estimate of total costs is probably good enough.

Then estimate the probability distribution of total costs (easier said than done!)

Once you have a measure of total costs at an enterprise level, you can imagine how that measure would vary quarter to quarter or year to year. If in your imagination you ran the clock forward far enough (e.g. decades), you'd accumulate a frequency distribution of quarterly costs that would be a decent approximation of the probability distribution of total costs, assuming it was stable that whole time.

To the naysayers who say we can never reliably estimate the cost of cyber security of risk at an enterprise level, I offer the counter argument that the historical frequency distribution of cyber security costs (if measured) appears to be fairly stable, and we don't see any evidence of frequent "blow ups" where unanticipated costs of a catastrophic level destroy firms.

I also argue that estimates of the probability distribution of costs at an enterprise level is more feasible and stable than at the micro level. At the level of individual assets, individual controls, and individual attacks, the specific parameters of any estimate will depend on the particular state of both attackers and defenders, and state of evolution of each, and on their tactical and strategic intents. But these factors tend to average out as you aggregate them to the level of an enterprise, where the probability distribution of costs will be driven by broad, general changes in the landscape of attack and defense.

Of course, we can't count on a stable probability distribution, so instead of estimating it purely from empirical frequencies it will be essential to estimate it through various forward-looking methods, including causal analysis, simulation, large scale data analysis, and so on.

Then partition the cost distribution into three sections

Here's where we dip into some topics in finance theory. To keep it short, I'm skipping all the formalisms, foundations, justifications, and details of methods. Interested readers can get more from the resources listed below.

First, notice that this probability distribution is not a symmetric bell shaped Gaussian, but a skewed distribution with a long "tail" that expresses the small probability of large, very large, and very very large loss events. This makes it non-trivial to convert this into a single or even small number of financial metrics. One method you may have heard of in financial services is called "Value at Risk", but it has pros and cons, but it's not the best approach for our purposes.

What we need is a way to create a combined cost estimate that includes both the most frequent cost outcomes as well as a "discounted" cost estimate of the low probability/high impact events.

One method is to create a composite metric that is the sum of metrics for three sections ("quartiles") of the distribution, including the full range of probable outcomes (the "support" of the distribution).

Starting from the left, the first section is called "budgeted" because rational managers would set their quarterly budgets to pay for the most likely cost outcomes (measured by either the mean or mode of the distribution, plus some extra margin). I believe that nearly all organizations can reliably set quarterly budgets for total costs (as defined above), and this provides on-the-face evidence of the feasibility of reliably estimating the "budgeted" costs.

Moving to the yellow region on the right is a section called "self-insurance". These are the cost outcomes that are worse than what is budgeted normally, but less than catastrophic and not worst-case. I call it "self-insurance" because I believe that the best way to price this risk is to imagine setting up a self-insurance fund that is actuarially sound. This requires quite a bit of effort to estimate reliably and will benefit from triangulating using multiple methods and multiple sources of evidence.

The final region is called "catastrophic" because, if such a loss event were to occur, it might be catastrophic for the focal organization. It's open ended because it may not be possible to bound the worst-case loss with a non-zero probability. This is the region of loss distributions that causes the most problems in risk valuation models, especially if you care most about systemic and cascading risk. However, we are most concerned with risk valuation (i.e. putting a price on it). So what is the best way to value this catastrophic risk? I suggest that it best measured by the willingness of a firm to invest ex ante business continuity. This could include back up resources, data stored in vaults, redundant resources and pre-provisioned resources, lines of credit, and gold buried in the back yard.

Therefore, the total cost of cyber security is the sum of budgeted costs, self-insurance costs, and business continuity costs.

How to use it

There are two main uses of this measure. The first use case is to decompose it into the factors that drive risk, and then apply resources and attention to mitigate the largest risk drivers. The method of cost drivers is well established in Activity Based Costing. In many risk management frameworks there is a concept of risk drivers and the related idea of Key Risk Indicators (KRIs). The only difference here is that we are interested in risk drivers within this broad Ten Dimensional framework. By the way, I think this use case applies well even if an organization is only starting to define and measure total costs and has not yet estimated the probability distribution with any confidence.

The second use case is to define and allocate risk budgets to organization units and departments in much the same way that capital budgets are defined and allocated. The core idea here is that leaders of organization units can make decisions that might reduce risk or increase it, and if they had signals or incentives in the form of a risk budget that they "spend" through their decisions, they will make better cost/benefit decisions.