Big data remains very popular in cybersecurity. There is a general perception that if we can get network data and ecosystem telemetry into a big data engine, it will improve our ability to identify malicious behaviors. There are two significant flaws with this theory:
1. Big data analytics tools are dependent upon the content they are fed from data sources,
2. Analysis without context fails to establish threat relevance and is not useful for defense, detection and remediation.
Typical data sources such as Syslog and Netflow are missing all of the key indicators of malicious behaviors, and instead depict activity that appears to the typical data analytics engines as uncharacterized environmental traffic.
SIEMs look to this data and compare traffic and volumes to identify indicators of compromise based on the thresholds they set for alerting across all of the monitoring categories and devices. But they miss a lot of what goes on.
Bad guys figured out a while ago how to circumvent the thresholds.
The defaults are published in the user guides and most folks don’t change them.
What Has Changed?
As malware continues to evolve and insiders now operate largely in stealth mode, not as many of the indicative data elements are showing up in these logs, flows and baselines.
In addition, today’s coordinated attacks are multi-stage and multi-vector. But because traditional big data analytics examines discrete or even aggregated events out of context they miss the subtle patterns and sequences of related behaviors that cyber criminals are now using consistently across the global threat landscape to assemble an effective Attack-in-Depth invasion model.
Attack-in-Depth is the current version of the once popular cyber kill chain model that works by delivering payloads, persisting on endpoints, taking hold across the network and exfiltrating or destroying information assets.
To successfully combat these Attack-in-Depth threats, we might shift our approach to contextual data analytics. An effective analytic engine must be fed the otherwise hidden indicators of malicious behaviors, indicators that are only detected with the right type of analytics.
These analytic engines need to use algorithms that are constructed to detect both structured and unstructured malicious behaviors within the context of a specific threat envelope. That threat envelope must be informed by patterns of behavior occurring outside the network and telemetry across a spectrum of threat landscape external to the operation.
And, these engines need to be able to operate on this data in real time to identify and isolate an infection after a network has been invaded and before the assets can be breached. Contextual analytics is an enabler.
At their core, analytics engines typically follow one of four primary reasoning methodologies:
Deductive reasoning is based in the theory of deductive inference that draws specific conclusions from general rules. For example, if A = B and B = C, then A = C, regardless of what A or B contains. Deductive reasoning tracks from a general rule to a specific conclusion. If original assertions are true then the conclusion must be true. A fundamental weakness of deductive reasoning is it’s often tautological (e.g. Malware contains malicious code and is always true) and it is unaffected by contextual inputs – to earn a master’s degree, a student must have 32 credits. Tim has 40 credits, so Tim will earn a master’s degree, except when he decides not to.
In security analytics, A only equals B most of the time and sometimes it can equal D, so A cannot always equal C, therefore using deductive reasoning as a basis for detection analytics is a flawed way to try and predict the future. A theoretical guarantee of being breached at least once.
In general, common signature-based systems such as IDS/IPS and endpoint security are deductive in nature.
Inductive reasoning is the opposite of deductive reasoning. Inductive reasoning makes broad generalizations from specific observations. In inductive inference, we go from the specific to the general. We make many observations, discern a pattern, make a generalization and infer an explanation or a theory.
Where analytics engines are based on inductive reasoning, the resulting analytics resemble probability theory. Even if all of the premises are true in a statement, inductive reasoning allows for the conclusion to be false. Here’s an example: “Harold is a grandfather. Harold is bald. Therefore, all grandfathers are bald.” The conclusion does not follow logically from the evidence
Inductive is a better approach than deductive for projecting the future, but it is also imperfect and can produce even more widely varying results. Advanced IDS/IPS systems use heuristics to identify malicious behaviors. A heuristic is a rule that provides a shortcut to solving difficult problems. Heuristics are used when you have limited time and/or information to make a decision. Heuristics lead you to a good decision much of the time.
Heuristics are frequently used to generalize the probability of malicious behaviors based on limited input (e.g., known signatures).
Without enough certainty to prove a positive is indeed a real positive, security and SOC analysts chase down rabbit holes all day long.
Bayesian or Recursive Bayesian Estimation (RBE) Reasoning – This analytic approach is anomaly-oriented and is used in security systems to provide a less tactical view of what’s happened over an extended time-frame (e.g. 30 days).
In statistics, “standard deviation” is a measure that is used to quantify the amount of variation or dispersion of a set of data values. A standard deviation close to 0 indicates that the data points tend to be very close to the mean value of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values.
In most Bayesian based security analytics, when a result is 3 standard deviations from normal, the system declares it an “anomaly.” The goal of Bayesian Reasoning is to be able to identify a “normal” pattern of behavior by observing subtle fluctuations in activity within the enterprise infrastructure. The result is a baseline which is used as a subsequent “benchmark” against which all network activity and/or behaviors will be measured.
This baselining is often flawed and can lead to extraordinary outcomes none of which will result in properly identified threats. There are three significant problems with this approach:
1. If the network and/or the systems being baselined are already infected before the baseline is created then the baseline establishes a false premise,
2. If an insider is already active on a network, that insider’s actions will appear as nominal and become part of the “normal” baseline, and
3. Today’s network infrastructure and user behavior is increasingly dynamic, variable and diverse involving many different devices and protocols, access methods and entry points essentially making a baseline assessment impossible without a network lock-down.
Analytics engines that use baselining as their premise for Bayesian reasoning are prone to extreme volumes of false positives, are cumbersome and difficult to tune and administer and frequently miss malicious invasions.
Abductive reasoning is a form of logical inference that goes from an observation to a hypothesis that accounts for the observation, ideally seeking to find the simplest and most likely explanation.
A simple definition is that abductive begins with evidence that builds to a hypothesis. Inductive and deductive start with a hypothesis and seek supporting evidence. Bayesian begins with an extrapolation of evidence that may have been tampered, seeking to draw a conclusion.
In abductive reasoning, unlike in deductive or inductive reasoning, the premises do not guarantee the conclusion. Abductive reasoning typically begins with an incomplete set of observations and proceeds to the likeliest possible explanation for the set. Abductive reasoning yields the kind of daily decision-making that does its best with the information at hand, which is often incomplete, yet likely provides the most optimized model upon which can be built an automated result.
Medicine and Courts of Law
A medical diagnosis is a use-case for abductive reasoning: given a set of symptoms, what is the diagnosis that would best explain most of them?
Likewise, when jurors hear evidence in a criminal case, they must consider whether the prosecution or the defense has the best explanation to cover all the points of evidence. While there may be no certainty about their verdict, since there may exist additional evidence that was not admitted in the case, they make their best guess based on what they know.
While cogent inductive reasoning requires that the evidence that might shed light on the subject be fairly complete, whether positive or negative, abductive reasoning is characterized by an incomplete set of observations, either in the evidence or in the explanation, or both, yet leading to the likeliest possible conclusion.
A patient may be unconscious or fail to report every symptom, for example, resulting in incomplete evidence, or a doctor may arrive at a diagnosis that fails to explain several of the symptoms. Still, s/he must reach the best diagnosis possible given the evidence.
Probabilistic abductive reasoning is a form of abductive validation, and is used extensively and very successfully in areas where conclusions about possible hypotheses need to be derived, such as for making diagnoses from medical tests, working through the judicial process or predicting the presence of malware.
To successfully counter and defeat malware and malicious insiders requires better data analytics, not big data analytics. At a deeper level, we need the right analytics methods using the right detection engines and delivering evidence in real time and within context about the systems we are protecting.
While we have made some progress with regard to the automation of detection systems, they still do not rise to the level where human intervention can be eliminated.
If our threat defense systems continue to approach the cybersecurity problem armed with uninformed, decontextualized and/or otherwise flawed data, we will make little progress on our path to defeat our adversaries who, as we have seen with the SolarWinds, Accellion, Colonial, JBS and Microsoft breaches, are far ahead of us.
We need a consolidated national effort with public and private on the same page, and modernized laws that will allow us to counter the enemies, but we also need better technology.
The question isn’t what happens when the bad guys get good at AI and ML. The question is when will the good guys be able to leverage these advances within our automated detection and response systems.
When we examine the current state of technology development, it is hard to find ‘solutions’ that are actually solutions.
It’s time to find a new gear.