Skip to main content

Using Anomaly Detection to Reduce 20 Million Alerts Per Day to 200

Alert Fatigue. Alert Triage. Alert Prioritization. Security teams at many organizations generate more alerts than they can effectively handle. Their firewalls are too chatty. Their antivirus solution generates the same alerts all the time. Their threat intel feeds generate too many false positives. Going through them and manually whitelisting things is too much work – a system that automatically identifies the important alerts is desired. In this post, we describe how we leverage anomaly detection to help reduce alert volumes and focus analyst attention on the most important alerts.

The Case Study

We have a client whose network and endpoint monitoring solutions together generate more than 5 billion events and 600 million alerts per month – more than 200 alerts per second. The network alerts account for the vast majority of this volume, but the endpoint alerts are not insignificant. They account for 1.3 million alerts per month, more than 1,700 per hour.
The goal of this study was to automatically prioritize the alerts, to ensure that the organization would not miss the important alerts in a sea of uninteresting ones. This would also reduce the burden on the analysts, enabling the organization to spend less time on alert triage and more time investigating and protecting against real threats, doing proactive threat hunting, and taking other proactive security measures.
Below is a summary of the results for a 30-day period:
Before
After
Reduction in Alerts
600,000,000+
6,200
100,000 to1

The result was a very manageable average of 200 alerts per day.

The Anomaly Detection Approach

Anyone who has spent a lot of time staring at security alerts has noticed patterns. A user runs an application that generates the same alerts everyday. Regularly scheduled updates cause the same alerts to be raised across the company. A particular user has a penchant for downloading web toolbars, which raise a barrage of alerts.
These patterns are incredibly helpful in triaging alerts. But an analyst has limited time and cognitive bandwidth to identify them, encode them, and communicate them to their team. This process is expensive and error prone, and is never going to be able to identify all the patterns. Instead, we have taken an anomaly detection approach to alert triage, where the anomaly detection algorithms do all of this work for the analyst. The value of this is three-fold.
  • It helps to cut through the noise, those pesky alerts you see everyday.
  • It surfaces the anomalies in groups, so analysts are simultaneously investigating and resolving multiple alerts.
  • It presents all the information the algorithms used to identify the alerts to the analyst, so that they have all the background and context the algorithms had to make their decision.
Some of the anomalies we surface are entity-centric. Here, an entity could be a user, host, or IP address. Such anomalies include:
  • Entities generating alerts of types that are rare for that entity.
  • Entities generating spikes in alerts in total or by type.
  • Entities generating abnormal distributions of alerts by type.
Similarly, the above alerts could be grouped by severity, source, or any other information about the alert. Other anomalies focus more on organization-wide statistics. For example:
  • Alert types that are being observed on more entities than usual.
  • Alerts on specific indicators that are being observed on more entities than usual.
This distinction is an important one. The former technique is optimized for identifying specific compromised entities. The latter technique is optimized to identify attacks targeting larger portions of the organization.
The anomaly detection techniques applied here are completely unsupervised, and don’t require training data. The goal is for them to be immediately useful, without any feedback or manual tuning. As analysts add feedback, it can be used to refine the approach even further.

The Results

This section provides an example of the anomaly lists, providing details of why the highlighted anomalies were significant. Hostnames, dates, and IP addresses have been changed.
We begin with entity-specific anomalies. The following list shows hosts that had spikes in the number of different types of alerts they raised on a particular day. There were 6 spikes observed on 5 hosts.
April 5th 2017 ray-mbp had 7 different types of alerts, a spike
April 11th 2017 eddie-win had 17 different types of alerts, a spike
April 19th 2017 colin-mbp had 5 different types of alerts, a spike
April 19th 2017 eddie-win had 27 different types of alerts, a spike
April 29th 2017 eddie-mbp had 8 different types of alerts, a spike
April 29th 2017 vivek-mbp had 9 different types of alerts, a spike

Investigations of these spikes revealed a few different types of security incidents. Three of the spikes were caused by users installing applications packaged with malware. One involved installation of suspicious software. Two were quickly determined to be benign, caused by normal installation activity. In one case, this activity was followed closely by visits to .ru and .cn websites and execution of multiple files that raised alerts.
The following graph shows the host and some of the processes that generated alerts and their relationships to each other. It also shows that there was a simultaneous alert on some network communication between the host and an external IP address.
Using Anomaly Detection to Reduce 20 Million Alerts Per Day to 200

Other lists of entity-centric anomalies yielded other interesting results, including:
  • A SIP exploit attempt originating in Germany against a large block of IP addresses, identified by the spike in activity from the attacker IP.
    Using Anomaly Detection to Reduce 20 Million Alerts Per Day to 200
  • A host connecting with known Zeus CnC servers using curl and generating multiple simultaneous alerts from their web browser, identified by the strange alert types being raised on that host.
  • A host that had been hit with an exploit, identified by unusual parent-child process relationships in the alert.
  • An internal user trying to brute for a password to an internal host, identified by the spike in alert activity.
Among the organization-wide anomalies were the following:
  • 33 hosts all infected with the same malware, identified by a spike in the number of machines generating the alert.
  • 3 users using remote access software, identified by a spike in the number of users generating the alert.

Conclusion

Anomaly detection is an effective tool for prioritizing the most important alerts and significantly cutting down on the total number of alerts an analyst has to deal with. The results shared in this post show how anomaly detection effectively cut a list of 600 million alerts down to a handful of short, easy to understand lists of alert anomalies, enabling rapid identification of real threats.
This approach is targeted toward both mature security organizations who already have a handle on their alerts, and are looking to improve prioritization or streamline processes, and especially those who feel like they are drowning in alerts, are constantly putting out fires, or can’t hire enough good people to deal with the volumes of data they are seeing.
Anomaly detection is just one of many ways that we help to reduce alert volume, prioritize investigations, and identify real security incidents. Keep following us for more information about how we help streamline security investigations using graph analytics and sophisticated alert scoring.

Popular posts from this blog

Data Exfiltration from AWS S3 Buckets

You will have no doubt heard by now about the recent Booz Allen Hamilton breach that took place on Amazon Web Services – in short, a shocking collection of 60,000 government sensitive files were left on a public S3 bucket (file storage in Amazon Web Services) for all to see. We are all probably too overwhelmed to care, given all the recent breaches we have been hearing about in the news. But with this breach it was different, it involved a trusted and appointed contractor whose job it was to follow security policies, put in place to avoid such incidents. So was this incident accidental or malicious? More, later about the tools we can use to tell the difference between the two. First, lets recap what happened. The Incident According to Gizmodo , the 28GB of data that was leaked not only contained sensitive information on recent government projects, but at least a half dozen unencrypted passwords belonging to government contractors with Top Secret Clearance – meaning anyone who got

Sift Joins Netskope, the Cloud Security Leader

Four years ago, we started Sift with the mission of simplifying security operations and incident response for the public cloud. In that time, we have assembled a fantastic team, created an innovative cloud detection and response solution, and have worked with many market-leading customers. I’m delighted to share that we’ve taken yet another step forward — as announced today, Sift is now officially part of Netskope. You can read more about this on Netskope CEO Sanjay Beri’s  blog  or in the official  announcement  on the Netskope website. For our customers, investors, partners, and team, this is an exciting new chapter. Let me tell you why we’re so excited.  Since the beginning, Netskope has had an unmatched vision for the cloud security market. Having started in 2012, they initially focused on SaaS security and quickly followed that with IaaS security capabilities. Six years later, they are now more than 500 employees strong and used by a quarter of the Fortune 100. They are a l

Sift Security vs. Elastic Search and Elastic Graph

We are often asked, “What is the difference between Sift Security and Elastic Graph ?” This is a great question that typically comes from folks who are already familiar with Elasticsearch [0] and Elastic Graph [1]. The answer boils down to the following: Elastic Graph is a tool for visualizing arbitrary aggregate search results. Elasticsearch is a Restful search that distributed, and has analytics engine that solves a number of use cases such as mapping from Python to ES REST endpoints. Sift Security uses a graph database to simplify and accelerate specific security use cases. In this blog post, we describe the advantages of each of these approaches, and conclude with a discussion of when to use each. Advantages of Sift Security vs ElasticSearch and Elastic Graph Query speed Sift Security builds a property graph to represent security log events at ingestion time.  We do this work at ingestion time for one reason:  to speed up common investigative queries.  When investi