Skip to main content

Using Anomaly Detection to Reduce 20 Million Alerts Per Day to 200

Alert Fatigue. Alert Triage. Alert Prioritization. Security teams at many organizations generate more alerts than they can effectively handle. Their firewalls are too chatty. Their antivirus solution generates the same alerts all the time. Their threat intel feeds generate too many false positives. Going through them and manually whitelisting things is too much work – a system that automatically identifies the important alerts is desired. In this post, we describe how we leverage anomaly detection to help reduce alert volumes and focus analyst attention on the most important alerts.

The Case Study

We have a client whose network and endpoint monitoring solutions together generate more than 5 billion events and 600 million alerts per month – more than 200 alerts per second. The network alerts account for the vast majority of this volume, but the endpoint alerts are not insignificant. They account for 1.3 million alerts per month, more than 1,700 per hour.
The goal of this study was to automatically prioritize the alerts, to ensure that the organization would not miss the important alerts in a sea of uninteresting ones. This would also reduce the burden on the analysts, enabling the organization to spend less time on alert triage and more time investigating and protecting against real threats, doing proactive threat hunting, and taking other proactive security measures.
Below is a summary of the results for a 30-day period:
Before
After
Reduction in Alerts
600,000,000+
6,200
100,000 to1

The result was a very manageable average of 200 alerts per day.

The Anomaly Detection Approach

Anyone who has spent a lot of time staring at security alerts has noticed patterns. A user runs an application that generates the same alerts everyday. Regularly scheduled updates cause the same alerts to be raised across the company. A particular user has a penchant for downloading web toolbars, which raise a barrage of alerts.
These patterns are incredibly helpful in triaging alerts. But an analyst has limited time and cognitive bandwidth to identify them, encode them, and communicate them to their team. This process is expensive and error prone, and is never going to be able to identify all the patterns. Instead, we have taken an anomaly detection approach to alert triage, where the anomaly detection algorithms do all of this work for the analyst. The value of this is three-fold.
  • It helps to cut through the noise, those pesky alerts you see everyday.
  • It surfaces the anomalies in groups, so analysts are simultaneously investigating and resolving multiple alerts.
  • It presents all the information the algorithms used to identify the alerts to the analyst, so that they have all the background and context the algorithms had to make their decision.
Some of the anomalies we surface are entity-centric. Here, an entity could be a user, host, or IP address. Such anomalies include:
  • Entities generating alerts of types that are rare for that entity.
  • Entities generating spikes in alerts in total or by type.
  • Entities generating abnormal distributions of alerts by type.
Similarly, the above alerts could be grouped by severity, source, or any other information about the alert. Other anomalies focus more on organization-wide statistics. For example:
  • Alert types that are being observed on more entities than usual.
  • Alerts on specific indicators that are being observed on more entities than usual.
This distinction is an important one. The former technique is optimized for identifying specific compromised entities. The latter technique is optimized to identify attacks targeting larger portions of the organization.
The anomaly detection techniques applied here are completely unsupervised, and don’t require training data. The goal is for them to be immediately useful, without any feedback or manual tuning. As analysts add feedback, it can be used to refine the approach even further.

The Results

This section provides an example of the anomaly lists, providing details of why the highlighted anomalies were significant. Hostnames, dates, and IP addresses have been changed.
We begin with entity-specific anomalies. The following list shows hosts that had spikes in the number of different types of alerts they raised on a particular day. There were 6 spikes observed on 5 hosts.
April 5th 2017 ray-mbp had 7 different types of alerts, a spike
April 11th 2017 eddie-win had 17 different types of alerts, a spike
April 19th 2017 colin-mbp had 5 different types of alerts, a spike
April 19th 2017 eddie-win had 27 different types of alerts, a spike
April 29th 2017 eddie-mbp had 8 different types of alerts, a spike
April 29th 2017 vivek-mbp had 9 different types of alerts, a spike

Investigations of these spikes revealed a few different types of security incidents. Three of the spikes were caused by users installing applications packaged with malware. One involved installation of suspicious software. Two were quickly determined to be benign, caused by normal installation activity. In one case, this activity was followed closely by visits to .ru and .cn websites and execution of multiple files that raised alerts.
The following graph shows the host and some of the processes that generated alerts and their relationships to each other. It also shows that there was a simultaneous alert on some network communication between the host and an external IP address.
Using Anomaly Detection to Reduce 20 Million Alerts Per Day to 200

Other lists of entity-centric anomalies yielded other interesting results, including:
  • A SIP exploit attempt originating in Germany against a large block of IP addresses, identified by the spike in activity from the attacker IP.
    Using Anomaly Detection to Reduce 20 Million Alerts Per Day to 200
  • A host connecting with known Zeus CnC servers using curl and generating multiple simultaneous alerts from their web browser, identified by the strange alert types being raised on that host.
  • A host that had been hit with an exploit, identified by unusual parent-child process relationships in the alert.
  • An internal user trying to brute for a password to an internal host, identified by the spike in alert activity.
Among the organization-wide anomalies were the following:
  • 33 hosts all infected with the same malware, identified by a spike in the number of machines generating the alert.
  • 3 users using remote access software, identified by a spike in the number of users generating the alert.

Conclusion

Anomaly detection is an effective tool for prioritizing the most important alerts and significantly cutting down on the total number of alerts an analyst has to deal with. The results shared in this post show how anomaly detection effectively cut a list of 600 million alerts down to a handful of short, easy to understand lists of alert anomalies, enabling rapid identification of real threats.
This approach is targeted toward both mature security organizations who already have a handle on their alerts, and are looking to improve prioritization or streamline processes, and especially those who feel like they are drowning in alerts, are constantly putting out fires, or can’t hire enough good people to deal with the volumes of data they are seeing.
Anomaly detection is just one of many ways that we help to reduce alert volume, prioritize investigations, and identify real security incidents. Keep following us for more information about how we help streamline security investigations using graph analytics and sophisticated alert scoring.

Popular posts from this blog

Sift Joins Netskope, the Cloud Security Leader

Four years ago, we started Sift with the mission of simplifying security operations and incident response for the public cloud. In that time, we have assembled a fantastic team, created an innovative cloud detection and response solution, and have worked with many market-leading customers. I’m delighted to share that we’ve taken yet another step forward — as announced today, Sift is now officially part of Netskope. You can read more about this on Netskope CEO Sanjay Beri’s blog or in the official announcement on the Netskope website.
For our customers, investors, partners, and team, this is an exciting new chapter. Let me tell you why we’re so excited.  Since the beginning, Netskope has had an unmatched vision for the cloud security market. Having started in 2012, they initially focused on SaaS security and quickly followed that with IaaS security capabilities. Six years later, they are now more than 500 employees strong and used by a quarter of the Fortune 100. They are a leader in …

Sift Security Tools Release for AWS Monitoring - CloudHunter

We are excited to release CloudHunter, a web service similar to AWS CloudTrail that allows customers to visually explore and investigate their AWS cloud infrastructure.  At Sift, we felt this integration would be important for 2 main reasons:
Investigating events happening in AWS directly from Amazon is painful, unless you know exactly what event you're looking for.There are not many solutions that allow customers to follow chains of events spanning across the on-premises network and AWS on a single screen. At Netflix, we spent a lot of time creating custom tools to address security concerns in our AWS infrastructure because we needed to supplement the AWS logs, and created visualizations based on that data.  The amazing suite of open source tools from Netflix are the solutions they used to resolve their own pain points.  Hosting microservices in the cloud with continuous integration and continuous deployment can be extremely efficient and robust.  However, tracking events, espec…

Integration with Amazon GuardDuty

What is Amazon GuardDuty?
Amazon GuardDuty is a continuous security monitoring platform that analyzes and processes VPC flow logs, AWS CloudTrail event logs and DNS logs. It uses threat intelligence feeds, such as lists of malicious IPs and domains to identify malicious activity within your AWS environment.
You can enable the GuardDuty Service through your Amazon Console. Once there, you are then presented with the GuardDuty dashboard, as shown in the example below:














Finding are rated as High, Medium or Low on the dashboard and have the following meaning:  

Highfindingsindicates that the resource in question is compromised and is actively being used for unauthorized purposes.

Medium findings indicates suspicious activity, for example, a large amount of traffic being returned to a remote host that is hiding behind the Tor network.

Low findingsindicates suspicious or malicious activity that was blocked before it compromised your resource.

Integrating CloudHunter with GuardDuty

CloudHunter integr…