In a recent article on the OPM hack, the author describes a pretty typical security situation for a large enterprise:
The Office of Personnel Management repels 10 million attempted digital intrusions per month—mostly the kinds of port scans and phishing attacks that plague every large-scale Internet presence—so it wasn’t too abnormal to discover that something had gotten lucky and slipped through the agency’s defenses.
Enormous pressure at scale from criminals makes automated systems essential for security. While humans can inspect packages coming into the building, only a computer can work quickly enough to inspect packets. Firewalls are the prototypical example: you allow certain traffic through according to a set of rules based on the source and destination IPs and the ports and protocols being used.
In recent years, there's been a lot of buzz about machine learning in cybersecurity--wouldn't it be great if your automated system could learn and adapt, stop threats you don’t even know about and find suspicious shifts in behavior? Given the success of deep learning at recognizing images and speech, of latent factors for music recommendation systems, of reinforcement learning for self-driving cars, it's reasonable to expect that cybersecurity might be one of the next fields to get a big boost from machine learning.
In this post we will discuss the three basic approaches to designing an automated system to detect things (threats or cats). We will talk about the situations in which each are effective and lay out the consequences for cybersecurity. Spoiler alert: more expert knowledge is needed for systems hoping to catch an APT than those trying to distinguish a dachshund from a siamese.
Let's start, as must all things online, with cats (and dogs). Say you're trying to build an automated system to distinguish cats from dogs in photos.
The most straightforward way would be with a bunch of rules, like a flowchart from the back page of Wired. For example:
* If the animal has vertical pupils, it's a cat. If you can't see the eyes, then check the mouth.
* If the mouth juts out from the face, it's a dog. If not, it could be a cat or a flat-faced dog like a pug.
* If the shape of the ears,...
You explicitly specify which features of the photo to pay attention to and give rules to follow based on the values of those features.
If you have some labelled data - like a bunch of photos labelled cat or dog - then you can let the machine learn the rules. You tell it the important features of each photo: eye shape, jaw length, ear length, fur color, etc. Then it compares the values for all those features for the cat photos and all those features for the dog photos and tries to learn rules that will differentiate the two groups. Typically that will take the form of a cat-score and a dog-score, each of which is built from a bunch of those parameters. Popular algorithms for the rule-learning part include logistic regression and random forests. One advantage of these systems is that the score can be explained: the score can be traced back to the features that most influenced it. This makes debugging easier; e.g., if one feature is driving garbage answers (or maybe you've been recording it wrong) you can see the problem. If you picked features that have enough information to distinguish the animals, then the machine can figure out the best way to combine that information to accomplish your goal.
If you have enough labelled data, then you can let the machine learn the features and the rules. You just give it the raw pixels and the label for each photo and say "go!". Deep neural nets are a popular algorithm in this category. They have been quite successful for image recognition if you have millions of labelled images. The machine needs a lot of data since you are asking it to learn to recognize things like fur and eyes etc, then to learn what to do with that information to make a decision. These methods can be hard to debug or explain, because none of the learned features come with names.
To summarize, you might call these
- Hand-written rules (Human-engineered rules with human-engineered features)
- Machine learning with human-engineered features
- Machine learning with machine-learned features
Let's see how this plays out for online music recommendations:
Beats (now part of Apple Music) was acquired for its excellent playlists. Expert curators assemble the best tracks and set rules for cycling them.
Pandora hired hundreds of music analysts to listen to every song and score it for 400 features, ranging from "strong horn lines" to “mood”. The application learns from your likes and dislikes which features are important for you.
Spotify takes the play matrix (who played which songs how many times) and tries to learn features that could explain the patterns in popularity. This method, where you use a massive collection of interactions or ratings to try to uncover hidden features, is called latent factor analysis. (For their popular Discover Weekly feature, they use methods from 2 as well. Very few production systems rely entirely on machine-learned features.)
Cybersecurity needs differ from images and music. There is an immense diversity present in a typical enterprise environment (thousands of distinct processes running on thousands of computers in offices and datacenters around the world), and you get many kinds of data. A system that takes in all the raw logs from an enterprise without any guidance will have a lot of work to do to create a reasonable internal world. It is also adversarial. People are deliberately trying to sneak past your defenses.
Labeled data are also very hard to come by. Since big attacks are rare, you don't have very many examples of what bad activity looks like. This makes it important to use all the knowledge you have available, a lot of which resides in the heads of security personnel--the makers of the product and the users of the product. The makers might know things like "a large number of IPs associated with a given URL can be a sign of a fast-flux botnet", and so include the count of distinct IPs as a feature. The users might know that "these servers handle our web traffic so they *should* be acting differently than the rest of the machines", and when they see false alarms triggering, they can add "server type" as a feature.
If we know that whiskers might be important for recognizing cats, it's best to make that a feature explicitly. The reason that machine-learned features are so important for image processing is that there is no "whisker pixel". It's pretty hard to say how many whiskers a cat has by looking at the numbers in raw pixel data. In security, the information is much more understandable. One record might show a piece of malware being executed and another record might show a bunch of data being sent to a foreign IP. This makes human-engineered features easier to write and easier to understand.
Imagine a security camera looking at a yard. You are worried about intruders. If you can tell it about the kinds of animals it might see in advance--cats, dogs, squirrels, people--it'll do much better than if you just ask it to tell you about strange things. A leaf fell, call the police!
Think about the fire detection problem: is there a fire in my house? Here's how the three approaches break down.
Smoke detectors. If you get more than a certain amount of particulate matter, sound the alarm! We know how often those go off without good reason. That's the problem with hand-written rules in a diverse environment.
Thermometers. Big sudden changes in temperature are a bad sign.
Video cameras. Install a dozen video cameras all over a building, start a few dozen fires yourself to get labelled data, and then build a model. Then you get a foggy morning and it dials the fire department. That's the problem with pure machine-learned features in a changing environment, where your labels come from historical data.
The Sift approach is to focus on human-engineered features, supported by rules and sometimes the rich data that can fuel machine-learned features. Using the fire detection analogy, if the system sees a sudden change in temperature that coincides with a beep from the smoke detector, it passes along a warning to a human operator along with the video feed to provide context. They can quickly decide to call the fire department or not.
At Sift we take a human-centered approach to machine learning. It's a person who will need to make decisions on the basis of the software's recommendations, and it's a person that that knows their environment the best. By using good judgement in designing features and providing context to evaluate the resulting alerts, we maximize overall effectiveness of the system: finding and mitigating threats.