Skip to main content

Using Anomaly Detection to Reduce 20 Million Alerts Per Day to 200

Alert Fatigue. Alert Triage. Alert Prioritization. Security teams at many organizations generate more alerts than they can effectively handle. Their firewalls are too chatty. Their antivirus solution generates the same alerts all the time. Their threat intel feeds generate too many false positives. Going through them and manually whitelisting things is too much work – a system that automatically identifies the important alerts is desired. In this post, we describe how we leverage anomaly detection to help reduce alert volumes and focus analyst attention on the most important alerts.

The Case Study

We have a client whose network and endpoint monitoring solutions together generate more than 5 billion events and 600 million alerts per month – more than 200 alerts per second. The network alerts account for the vast majority of this volume, but the endpoint alerts are not insignificant. They account for 1.3 million alerts per month, more than 1,700 per hour.
The goal of this study was to automatically prioritize the alerts, to ensure that the organization would not miss the important alerts in a sea of uninteresting ones. This would also reduce the burden on the analysts, enabling the organization to spend less time on alert triage and more time investigating and protecting against real threats, doing proactive threat hunting, and taking other proactive security measures.
Below is a summary of the results for a 30-day period:
Before
After
Reduction in Alerts
600,000,000+
6,200
100,000 to1

The result was a very manageable average of 200 alerts per day.

The Anomaly Detection Approach

Anyone who has spent a lot of time staring at security alerts has noticed patterns. A user runs an application that generates the same alerts everyday. Regularly scheduled updates cause the same alerts to be raised across the company. A particular user has a penchant for downloading web toolbars, which raise a barrage of alerts.
These patterns are incredibly helpful in triaging alerts. But an analyst has limited time and cognitive bandwidth to identify them, encode them, and communicate them to their team. This process is expensive and error prone, and is never going to be able to identify all the patterns. Instead, we have taken an anomaly detection approach to alert triage, where the anomaly detection algorithms do all of this work for the analyst. The value of this is three-fold.
  • It helps to cut through the noise, those pesky alerts you see everyday.
  • It surfaces the anomalies in groups, so analysts are simultaneously investigating and resolving multiple alerts.
  • It presents all the information the algorithms used to identify the alerts to the analyst, so that they have all the background and context the algorithms had to make their decision.
Some of the anomalies we surface are entity-centric. Here, an entity could be a user, host, or IP address. Such anomalies include:
  • Entities generating alerts of types that are rare for that entity.
  • Entities generating spikes in alerts in total or by type.
  • Entities generating abnormal distributions of alerts by type.
Similarly, the above alerts could be grouped by severity, source, or any other information about the alert. Other anomalies focus more on organization-wide statistics. For example:
  • Alert types that are being observed on more entities than usual.
  • Alerts on specific indicators that are being observed on more entities than usual.
This distinction is an important one. The former technique is optimized for identifying specific compromised entities. The latter technique is optimized to identify attacks targeting larger portions of the organization.
The anomaly detection techniques applied here are completely unsupervised, and don’t require training data. The goal is for them to be immediately useful, without any feedback or manual tuning. As analysts add feedback, it can be used to refine the approach even further.

The Results

This section provides an example of the anomaly lists, providing details of why the highlighted anomalies were significant. Hostnames, dates, and IP addresses have been changed.
We begin with entity-specific anomalies. The following list shows hosts that had spikes in the number of different types of alerts they raised on a particular day. There were 6 spikes observed on 5 hosts.
April 5th 2017 ray-mbp had 7 different types of alerts, a spike
April 11th 2017 eddie-win had 17 different types of alerts, a spike
April 19th 2017 colin-mbp had 5 different types of alerts, a spike
April 19th 2017 eddie-win had 27 different types of alerts, a spike
April 29th 2017 eddie-mbp had 8 different types of alerts, a spike
April 29th 2017 vivek-mbp had 9 different types of alerts, a spike

Investigations of these spikes revealed a few different types of security incidents. Three of the spikes were caused by users installing applications packaged with malware. One involved installation of suspicious software. Two were quickly determined to be benign, caused by normal installation activity. In one case, this activity was followed closely by visits to .ru and .cn websites and execution of multiple files that raised alerts.
The following graph shows the host and some of the processes that generated alerts and their relationships to each other. It also shows that there was a simultaneous alert on some network communication between the host and an external IP address.
Using Anomaly Detection to Reduce 20 Million Alerts Per Day to 200

Other lists of entity-centric anomalies yielded other interesting results, including:
  • A SIP exploit attempt originating in Germany against a large block of IP addresses, identified by the spike in activity from the attacker IP.
    Using Anomaly Detection to Reduce 20 Million Alerts Per Day to 200
  • A host connecting with known Zeus CnC servers using curl and generating multiple simultaneous alerts from their web browser, identified by the strange alert types being raised on that host.
  • A host that had been hit with an exploit, identified by unusual parent-child process relationships in the alert.
  • An internal user trying to brute for a password to an internal host, identified by the spike in alert activity.
Among the organization-wide anomalies were the following:
  • 33 hosts all infected with the same malware, identified by a spike in the number of machines generating the alert.
  • 3 users using remote access software, identified by a spike in the number of users generating the alert.

Conclusion

Anomaly detection is an effective tool for prioritizing the most important alerts and significantly cutting down on the total number of alerts an analyst has to deal with. The results shared in this post show how anomaly detection effectively cut a list of 600 million alerts down to a handful of short, easy to understand lists of alert anomalies, enabling rapid identification of real threats.
This approach is targeted toward both mature security organizations who already have a handle on their alerts, and are looking to improve prioritization or streamline processes, and especially those who feel like they are drowning in alerts, are constantly putting out fires, or can’t hire enough good people to deal with the volumes of data they are seeing.
Anomaly detection is just one of many ways that we help to reduce alert volume, prioritize investigations, and identify real security incidents. Keep following us for more information about how we help streamline security investigations using graph analytics and sophisticated alert scoring.

Comments

Popular posts from this blog

Sift Security Tools Release for AWS Monitoring - CloudHunter

We are excited to release CloudHunter, a web service similar to AWS CloudTrail that allows customers to visually explore and investigate their AWS cloud infrastructure.  At Sift, we felt this integration would be important for 2 main reasons:
Investigating events happening in AWS directly from Amazon is painful, unless you know exactly what event you're looking for.There are not many solutions that allow customers to follow chains of events spanning across the on-premises network and AWS on a single screen. At Netflix, we spent a lot of time creating custom tools to address security concerns in our AWS infrastructure because we needed to supplement the AWS logs, and created visualizations based on that data.  The amazing suite of open source tools from Netflix are the solutions they used to resolve their own pain points.  Hosting microservices in the cloud with continuous integration and continuous deployment can be extremely efficient and robust.  However, tracking events, espec…

Applying Machine Learning to Cybersecurity

In a recent article on the OPM hack, the author describes a pretty typical security situation for a large enterprise:The Office of Personnel Management repels 10 million attempted digital intrusions per month—mostly the kinds of port scans and phishing attacks that plague every large-scale Internet presence—so it wasn’t too abnormal to discover that something had gotten lucky and slipped through the agency’s defenses.Enormous pressure at scale from criminals makes automated systems essential for security. While humans can inspect packages coming into the building, only a computer can work quickly enough to inspect packets. Firewalls are the prototypical example: you allow certain traffic through according to a set of rules based on the source and destination IPs and the ports and protocols being used.In recent years, there's been a lot of buzz about machine learning in cybersecurity--wouldn't it be great if your automated system could learn and adapt, stop threats you don’t ev…

Sift Security and WannaCry

😢 The WannaCry ransomware attack has left security teams around the world scrambling to make sure they are protected and to assess whether they have been victimized. To protect themselves, organizations need to have visibility into which systems are vulnerable and be able to rapidly roll out patches.  To understand whether they have been targeted, they need visibility into the channels over which the ransomware is distributed.  To understand whether they have been infected, they need visibility into the endpoints.
Over the past weekend, I was bombarded with questions from current customers, potential customers, former colleagues, friends, and family.  Am I vulnerable? How do I protect myself? How do I know if I’ve been hit? What do I do if I’ve been hit? What can you do to help me?
This post focuses primarily on the last question. What can we at Sift Security do to help an organization respond to a massive ransomware attack? I break this down into four categories, visibility, analyt…