Sunday, September 8, 2013

Mr Langner is wrong. Risk management isn't 'bound to fail'. But it does need improvement and innovation.

In "Bound to Fail: Why Cyber Security Risk Cannot Simply Be 'Managed' Away" (Feb 2013) and a recent white paper, Ralph Langer argues that risk management is a fundamentally flawed approach to cyber security, especially for critical infrastructure.

Langner's views have persuaded some people and received attention in the media.  He gained some fame in the course of the investigation of the Stuxnet worm capabilities to exploit Siemens PLCs (programmable logic controllers). Specifically, Ralph was the first to assert that Stuxnet worm is a precision weapon aimed at sabotaging Iran's nuclear program. Langner also gains institutional credibility as a Nonresident Fellow at the Brookings Institute, who published the "Bound to Fail..." paper.  I'm guessing that Brookings PR department has been helping to get press attention for Langner's blog post critiquing NIST CSF and his proposed alternative: RIPE.  They were reported in seven on-line publications last week alone: here, here, here, herehere, here, and here.   (Note to self: get a publicist.)

In this long post, I'm going to critique Mr. Langner's critique of risk management, pointing to a few places where I agree with him, but I will present counter-arguments to his arguments that risk management is fundamentally flawed.

  • TL;DR version: There's plenty of innovation potential in the modern approach to risk management that Langner hasn't considered or doesn't know about. Therefore, "bound to fail" is false.  Instead, things are just now getting interesting.  Invest more, not less.

In the next post, I'll critique Mr. Langner's proposed alternative for an industrial control system security framework, which he dubs "Robust ICS Planning and Evaluation" (RIPE).

Summary of Langner's Argument

Langner and his co-author Perry Pederson (see endnote) present an elaborate argument that risk management is fundamentally flawed when applied to cyber security in critical infrastructure. Specifically, they are arguing against what I have dubbed "little 'r' risk management" as it is practiced.  His writings don't mention any other approach or method, nor any practice of risk management outside of cyber security (though there two references "complex math" in Financial Services risk management and how it led to the Great Recession of 2008).  He doesn't examine the theoretical basis for risk management nor any of the technical debates within risk analysis or management -- e.g. the Frequentist vs. Bayesian debate.

 Langner's argument include many warrants and claims, including but not limited to:
  1. Lack of empirical justification, and a track record of failure
  2. Lack of practical feasibility
  3. Lack of theoretical soundness
  4. Inadequate "maneuver speed" of risk mitigation in critical infrastructure environments
  5. Mis-aligned incentives between private interests of businesses and public interests of society and government
  6. It is held in low regard by process and control engineers (perceived as "black magic")
  7. Insufficient cyber system (ICS) understanding on the part of risk analysts
  8. "Risk acceptance" is inherently bad and wrong
  9. Lack of cyber governance -- i.e. "Risk management as a method to prioritize mitigation efforts and budget allocation will necessarily fail in cyber environments that are poorly documented and loosely governed."
  10. Untrustworthiness and low ethics of decision-makers (i.e. prone to shirking, blame-shifting, moral hazard, self-deception, gaming the system, etc.)
  11. Inferiority of business approaches to technical approaches
  12. North American Electric Reliability Corporation (NERC) and other regulatory bodies have tacitly admitted that risk-based cyber security has created more problems than it solved
Of this long list I'm only going to address the first four, shown in italics, because my interest is in the method of risk management, and especially where it is going on the trajectory of innovation.  Yes, all the other factors affect implementation success, but how they apply to Big 'R' Risk Management remains to be seen.  Plus, countering them all would lead to a very sprawling essay, which I'm trying to avoid in these posts (with mixed results!).

Implications

Langner draws two implications from these warrants.  The first is that risk management is "bound to fail" because these flaws are fundamental and intrinsic to risk management in all of it's imaginable manifestations, and that they can never be corrected or mitigated.  The second is that risk cannot be "managed away", which is another way of saying that business or non-technical approaches are inferior to technical approaches.

I believe that there are also two unspoken implications.  First, he's implying that no further effort or research should be devoted to risk management because it is a dead end.  Second, Langner's implying that anyone proposing or espousing risk management should be disregarded.  This is evident in Langner's discourse style.

Philosophical Context

It is my interpretation is that Langner (and many others who agree with him) are arguing from viewpoint of Technocratic Empiricism (my label), where the only proper basis for decisions are "hard facts" established through empirical observation, preferably in controlled circumstances that are properly instrumented, analyzed by experts guided by an established body of technical practice.  Technocratic Empiricism values "objective reality" as defined and limited by our direct experience and observation.  It does not accept individual or group opinions or estimates as valid evidence or basis for decisions, and it excludes from analysis any phenomena or evidence where uncertainties or ignorance are substantial. Thus, Technocratic Empiricism avoids prediction in most circumstances. However, Technocratic Empiricism does favor the accumulated knowledge and experience of experts, almost always technical experts and usually engineers. Theoretical science and scientists are not valued unless theory is converted into both empirical evidence and engineering practice.  Complex models and complex math are distrusted and are sometimes viewed as "elitist" or "Ivory Tower arrogance".  Finally, proponents often advocate command-and-control or mandate institutions where the judgements and decisions of experts are imposed on stakeholders because those solutions are "objectively superior".

I want to acknowledge this philosophical context as one of many but I'm not going to argue for or against it in the paragraphs below.  For the record, I'll state that my philosophical viewpoint is different (e.g. see these posts: here and here).

All the quotes below are taken from Langner with no modifications or emphasis added wherever possible.  For citations, "BTF" is abbreviation for the "Bound to Fail..." paper.

Critique of 1. Lack of Empirical Justification

"Can risk be effectively managed? The sober reality is that in respect to the cyber security of critical infrastructure, there is no empirical evidence that a risk-based approach, despite its near decade of practice, has had any success." (BTF p 3)
I agree, generally.  I believe that there are, in a few cases, empirical evidence of successful risk management in specific organizations, but these results are not publicized or are not well known.  To be specific, when it comes to "little 'r' risk" instantiated by NIST 800-series, for example, I agree that there is no evidence of success.  But neither Langner nor anyone else has evidence that innovations in risk management won't work, including Big 'R' Risk.

In general, there has also been no substantial empirical evidence that any approach is better, or that any  approach is even effective to some base level.  Therefore, Langner's statement is not evidence for or against risk management.

To be clear, I agree strongly that we are need empirical evidence of effectiveness.  My proposed solution: invest resources, time, and energy to empirically evaluate alternatives. It is a fact that only in the last few years has there been any meaningful investment in rigorous measurement, data collection, and evaluative analysis.

One root problem is the difficulty of measuring security, regardless of our approach to security.  By "approach" I'm talking about a comprehensive approach to cyber security at an enterprise level, not an approach to a narrow set of threats, vulnerabilities, processes, or events.  In some of those narrow areas there actually is good empirical justification for one approach over others (e.g. Secure Development Life Cycle).  Here's a fine paper, distilled from his PhD dissertation, describing the difficulties of measuring security generally and the lack of adequate progress as of 2009:

Verendel, V. (2009). Quantified security is a weak hypothesis: a critical survey of results and assumptions. In Proceedings of the 2009 New Security Paradigms Workshop (pp. 37 – 50). New York, NY, USA: ACM. doi:10.1145/1719030.1719036, On-line here.

There is a subtle implication in Langner's phrase "...despite its near decade of practice...".  The implication is that ten years is surely long enough to establish the validity of a method and, further, that it is surely long enough for it to be fully developed and explored, with no further innovation needed. Here, Langner considers the possible accuracy of probabilistic measurement related to one specific aspect of "little 'r' risk analysis":
"The validity of a measurement method is determined by how accurately it measures what it claims to measure. Applied to cyber risk, the question is, does the risk-based approach really measure the likelihood of experiencing a cyber attack? [...] For example, what exactly does a 'security control' control? The general problem is that there is no empirical co-variation between security controls and reported cyber security incidents in critical infrastructure; doing nothing (in terms of security controls) does not necessarily result in security incidents. In other words, quantifying security has not been proven as valid, despite concerted efforts in many fields such as computer science and economics.[13]" (BTF p 6)
There are several things wrong with this analysis.  Langner first implies the existence of empirical evidence regarding correlation ("co-variation" is his term) between security controls and reported breach incidents, and asserts that this evidence shows no correlation.  I have no idea what evidence he might be referring to here beyond his professional experience.  Then in the sentence that follows he draws conclusions from Verendel's paper that "quantifying security has not been proven as valid", implying that all avenues have been exhausted. My conclusions from Verendel's paper is very different -- that we haven't started going down the right avenues until very recently and therefore it's too early to expect conclusive evidence, let alone "proof".  The "decade" Langer refers two was mostly a lost decade but that isn't uncommon when new knowledge is finding it's way into practice.

A colleague of mine, Alex Hutton, has called risk management for information security a "proto-science".  (See this presentation, also this and this.)   If we look back in history at the development of other "sciences" through their early stages, we can see them go through a "proto-science" stage where the elements are rough, mis-structured, or missing, and practitioners are muddling their way through even in the absence of solid empirical justification that the "proto-science" will work or be an improvement over the alternatives.  This can be seen in the history of actuarial science, where it took decades to become institutionalized outside of life insurance. (BTW, this case may appear in my dissertation.)   Peter Drucker's rule of thumb was that it took 20 years on average for new knowledge to make it's way to empirically-proven solutions, products, and services.

In conclusion, Langner's warrant that there is a lack of empirical justification for risk management is flawed.  Therefore, lack of empirical evidence of success is not a basis for discrediting risk management.

Critique of 2. Lack of Practical Feasibility

"Several important assumptions underlie the risk-based approach: First, that it is possible to correctly identify all exploitable vulnerabilities for any given target of evaluation; second, that it is possible to measure the threat level, which implies correct identification of all potential attackers, and correct assessment of their motivation and capabilities; and third, that it is possible to correctly quantify all cost consequences resulting from a successful exploitation, and all of the cost of mitigation. If for any given target of evaluation any of these parameters doesn’t represent the full story (for example, because vulnerabilities have been overlooked or cost of mitigation failed to include substantial cost items), the risk assessment for such target of evaluation might turn out to be grossly incorrect. On a larger scale, where risk assessments of multiple targets of evaluation are used to prioritize mitigation efforts, such prioritization can again be grossly misleading." (BTF p 4)
I'm interpreting Langner's use of "possible" here to relate to practical feasibility.  In the next section, I'll talk about theoretical feasibility or possibility.

He's talking about the specifics of "little 'r' risk analysis", and I have the very same reservations and doubts about practical feasibility as I said in my recent post: "Risk Management: Out with the Old, In with the New".

But Langner's argument assumes that "little 'r' risk" is the only way to do probabilistic risk  management -- he doesn't hint at or allow for any alternative or innovation.  I assume this is because he's not aware of what's been going on in the risk management field, broadly, and that he doesn't follow the latest research.

Of course, practical feasibility is vital, and progress is being made in key areas.  For example, Langner implies that it is not "...possible to correctly quantify all cost consequences resulting from a successful exploitation, and all of the cost of mitigation".  However, our recent WEIS paper proposed such a practical method that was specifically designed to apply to critical infrastructure, including cascading failures.  As this method gets tested and refined, it should make a big improvement in the practical feasibility of risk analysis and also business continuity planning.

Feasibility will also be achieved when we integrate multiple methods of estimation and analysis, including historical data, experiments, simulations, and calibrated expert opinions.

In conclusion, Langner's warrant that risk management isn't practically feasible only applies to "little 'r' risk management", which I've also criticized.  He has nothing to say about the practical feasibility of innovative approaches like "Big 'R" Risk Management".

Critique of 3. Lack of Theoretical Soundness

While Langner never questions the validity of probability theory or probabilistic reasoning directly (as other critics have done), he does make sharp claims about the impossibility of using probabilistic reasoning in any setting with intelligent adversaries:
"In most cases, the gambling coin in the discussion of risk assessment is threat. Since there is no scientifically sound way to measure threat (which would involve quantifying the motivation and capability of potential adversaries and predicting their actions), it is always possible to dismiss the threat part of the risk equation as minimal." (BTF p 5)
This assertion is demonstrably false.  In the physical domain, so-called "predictive policing" is a success (see this and this).  Regarding the general problem of estimating motivation and capabilities of adversaries, it is theoretically and practically feasible as I describe here.  Of course, if you are stuck in the "little 'r' risk" box, then it can be hard to estimate the probability of a particular threat actor using a particular attack against a particular vulnerability, etc.  But once free of that box there is no theoretical reason why it would be impossible to estimate probabilities of attack patterns or frequency against a range of assets or organizations in a period of time.  We aren't facing Descartes Demon -- a rhetorical and  mostly-fictional and Bogeyman.

So the relevant theoretical challenge isn't "can we estimate adversary actions probabilistically?" but instead "can we estimate adversary actions with sufficient resolution and contextual relevance?" and also "can we improve our estimates based on direct and indirect evidence?".  Langner never addresses these questions because he dismisses the whole topic as impossible even in theory.

Langner raises a number of objections about probabilistic estimation of future events, in general.  For example:
"The concept of risk is predictive, since it arrives at assumptions about future events and their cost. Risk management is an attempt to control the future based on predictions of what the future will look like. Therefore, any determination of risk implicitly assumes a predictive timeframe. Unfortunately, cyber security experts rarely specify if their timeframe under consideration is a month, a year, a decade, or the lifetime of the target of evaluation. Failing to be specific on timeframe alone makes any risk assessment non-verifiable—usually without explicitly saying so. Even the National Institute of Standards and Technology (NIST) cautions that the validity of risk assessments are bounded in time, but does not offer a solution to this intractable problem." (BTF p 5)
There is quite a bit of misunderstanding in this statement.  For example, risk management doesn't attempt or claim to "control the future".  Instead, it's about taking action today to prepare or shape an uncertain future, including the nature and sources of our uncertainty.

I interpret Langner's discussion of "timeframe" as  both poor execution on the part of risk analysts and also as a theoretical flaw -- that timeframe is not definable in theory ("intractable").  But this is silly.  Of all the difficulties and problems in risk analysis, defining the operative time period for probability estimates is not one of them.  (Some weird things can happen in the math when the time period shrinks relative to the frequency of event.  This is where Itō calculus would come in, but, thankfully, to my knowledge it isn't needed in any of the innovative approaches I've heard of.)

In this quote, Langner asserts that, in principle, we will never have adequate information to do risk analysis and we therefore what we call "risk" is really Knightian uncertainty:
"An alternative approach to statistical probability that relies on historical data is the logical cause-and-consequence model. This model assumes a logical relation of cause and consequence, moderated by uncertainty. Risk is an intermediary concept between fate and uncertainty. Where cause and consequence are deterministic, the whole notion of risk becomes inappropriate.

"The logical link between root cause, moderating factors, and effect can be established either by experiment (demonstrated causality) or by statistics (inferred causality). If neither is possible, it is inappropriate to talk in terms of risk because no empirical prediction can be made. In other words, predictions without the possibility for empirical verification are as useless as astrology.

"While it may be difficult for computer scientists to admit, cyber security is not yet a science.[15] For cyber security, it turns out that what is often referred to as risk is just 'uncertainty', as any moderating factors appear to be unknown or immeasurable." (BTF p 6)
Continuing this line of reasoning, in this quote Langner expresses the claim that probabilistic estimate of "unknown" future events is "mistaken".  I interpret Langner here as expressing theoretical impossibility.
"The basic assumption embedded in this and all risk formulae is that unknown future events of an unknown frequency, unknown duration, unknown intensity, from an unknown assailant, with unknown motivations, and unknown consequences are quantifiable. Consequently, if one thinks s/he can measure the risk, the mistaken conclusion is that one can manage the risk." (BTF p 8)
However, this nihilist view is unwarranted because it assumes that just because we can't know everything about those future events with high precision that we can't know anything.  As I mentioned in the "Philosophical Context" subsection above, this is how Technocratic Empiricists view the world.   My counter argument is in this post: The Rainforest of Ignorance and Uncertainty.

Langner views risk analysis as intrinsically unreliable:
"The reliability of a method is the degree to which independent people arrive at identical (or at least very similar) results when applying the method to the same subject. It appears that the risk-based approach to cyber security has very low reliability." (BTF p 4)
This is a false if the risk estimations and analysis are conducted properly.  In his book How To Measure Anything, Doug Hubbard explains how to calibrate expert estimates.  Other methods apply to calibrating estimates from disparate sources, even with varying levels of relevance and credibility.  But a good case can be made that there is no requirement that a risk assessment done by different people should arrive at the same or very similar results on a given iteration of the assessment.  If risk assessment is done frequently and learning is effective, then what counts is convergence of independent estimators, not the output of any given estimator on a given iteration.

Finally, Langner discredits risk analysis because it is abstract arithmetic posing as exact science and overly complex, which implies that will not connect to reality:
"Abstract arithmetic for calculating risk may appeal to some, suggesting that risk is an exact science, yet most if not all the existing models have not been validated.[10]  It is also reasonable to suspect that the more complex risk calculations get, the less they can actually be linked to empirical reality." (BTF p 8)
He goes on point to the Great Recession as "the best example of risk quantification gone wrong", and then quotes Nicholas Taleb: "Anything that relies on correlation is charlatanism."

This is one of the weakest element of Langner's argument because it appeals to prejudice and an irrelevant example ("Gaussian Copula").  Taleb himself uses quite a bit of complex math to arrive at his conclusions.  Don't believe me?  Read his in-progress text book, starting page 14.

Critique of 4. Inadequate "Maneuver Speed”"

"Threats and vulnerabilities are moving targets. New threats and vulnerabilities can pop up overnight. A basic assumption of risk mitigation is that after having performed a risk assessment, security controls can be implemented in ample time before an incident materializes, putting defenders and attackers in some kind of race where the defenders try to outsmart attackers by anticipating their moves and implement working countermeasures in time. Even if that would work to some extent, industrial environments are not designed for rapid reconfiguration, making it practically impossible to implement mitigation for new threats in short order."
I heartily agree that what Langner calls "maneuver speed" is vital.  I emphasize this in my two posts on learning loops, here and here.  In fact, the whole reason-for-being of the Ten Dimensions is to promote agility and rapid innovation. I also agree that if risk assessment and estimates are done infrequently, as is common in "little 'r' risk", then it hampers "maneuver speed" and agility.  Not good.

Yet Langner's argument here is only valid for implementations of risk management that are tied to long cycles.  But this isn't the only way to do it.

One more thing. Langner's argument here has an odd structure.  Try replacing "risk mitigation" with any other approach to cyber security and I think you'll find that the argument remains valid.  While they call risk management a "reactive approach" that is inadequate to deal with the very long cycles of ICS equipment replacement (decades, sometimes), it's hard for me to imagine how any other approach is any different.  Every approach to cyber security has to cope with a changing landscape of threat agents, threats, and even the socio-technical environment it operates in.

Counter-argument to "Bound to Fail"

In summary, Langner argues that risk management is "bound to fail" because it fundamentally flawed and has too many intrinsic problems, both conceptual and practical.

My counter argument is that the past failings of risk management, both in concept and practice, are grist for the innovation mill, so to speak.  And I'm not just hand-waving or blowing smoke when I say this.  I'm actively involved in innovation activities, some of which have born fruit and others still in the making.

Developing new knowledge and applying it to a complex field like cyber security is a chancy, dare I say risky, process.  Mis-steps and false starts are common and should be expected.  We shouldn't expect this all to be born fully formed.  Above all, it will take some time to test and validate it empirically to the point where it is obviously beneficial to non-specialists.

Consider a recent example in another domain -- Artificial Intelligence.  There was an "AI Winter" in the late 80s and early 90s where the big promise and hype of Expert Systems proved to be overblown.  Businesses collapsed, as did funding for new ventures and academic research.  Some people wrote off AI all together, and I'm sure there were many I-told-you-so articles written by nay-sayers.  But out of this environment of "failure" came many successes today, including the machine learning that powers Google and Siri, not to mention IBM's Watson.  Another branch of success is Agent-based Modeling, which is being implemented at scale to create realistic ("history-friendly") models of financial crisis and refugee flows, as just two examples.

Finally, let me put my counter argument another way.  For Langner's argument to be valid, he would need to provide strong evidence that risk management entailed some violation of fundamental laws of computing -- i.e. that it was computational infeasible (NP-Complete) or was required algorithms that are uncomputable (i.e. solving the Halting Problem).  These would justify a conclusion of "bound to fail" and foreclose any possibility of innovating around them.  Barring such evidence, then any judgment future potential of risk management would need to include "...but it could also succeed" to allow for possible innovations now in sight.

Counter-argument to "Risk Cannot Be 'Managed Away' "

"A fundamental reason for this failure is the reliance on the concept of risk management, which frames the whole problem in business logic. Business logic ultimately gives the private sector every reason to argue the always hypothetical risk away, rather than solving the factual problem of insanely vulnerable cyber systems that control the nation’s most critical installations." (BTF p 1)
Basically, what he's calling "managed away" is an exercise in gamesmanship --  risk management gamed in a way that the cost of the future loss events can be masked or ignored, and also giving decision-makers a "get out of jail free" card.

My counter-argument is that the innovations in risk management that I and others are working on are aimed at exactly the opposite -- to make the cost of the future visible and actionable in the present so that it has a meaningful difference in our security posture, and to hold decision-makers accountable.

I also argue that there is no escaping the need to make priority decisions regarding the use of scarce resources -- time, people, money, equipment, commitments, and relationships.  I don't think there is any organization that can spend what ever it wants on cyber security (OK... maybe one...).  If the implication to Langner's approach is "cyber security costs what it costs", I argue that is a non-starter.  Like every other aspect of business, government, and society, there needs to be some accountability for using scarce resources wisely, and this inevitably involves prioritizing and economizing.  Thus, the "business logic" that Langner decries can never be banished.  Instead, there needs to be a new synthesis of "business logic" and "technical logic" (not to mention "social logic" and "political logic"), and that is exactly the motivating goal of the modern approach to risk management (Big 'R' Risk) and the Ten Dimensions I have proposed.

------------

Endnote: Perry Pederson is co-author of the "Bound to Fail..." paper but, to my knowledge, hasn't written anything on the topic of risk management since then, and thus I've chosen to associate these specific arguments and positions with Langner and especially his recent writing.

2 comments:

  1. I've read "Bound to fail" after seeing it mentioned a few times by some security experts.
    It was shocking to read so much biased, one sided diatribe against Risk Management and judging it against standards the author does not see fit to apply to his own thinking. Mr. Langner is obviously a very respected security consultant with which makes it more baffling to me that this research is beyond all expectations.

    One good thing that came out of this is that in search for some critique of that "research" I came across your blog. Your blog is very informative.

    ReplyDelete