In InfoSec, we generally have access to large volumes of unlabeled data. On the other hand, labeled data, when available at all, are very rare. This is due to the fact that the labeling process is expensive since it requires extensive investigations carried out by highly qualified personnel (i.e. security analysts). As a consequence, current InfoSec solutions resort to unsupervised algorithms to detect anomalies. But anomalous does not mean malicious, and the end result is an elevated number of false positives.
In contrast, supervised algorithms can separate data into semantically different categories. In particular, they can differentiate between benign events and different types of malicious attacks. The drawback is that in order to train supervised models, one needs labels! That is where active learning comes into play: active learning is the most cost-efficient approach to transition from an unsupervised setup (unlabeled data) to a supervised setup (labeled data).
What is active learning and when is it useful?
Active learning is a machine learning protocol, in which a learning algorithm interactively requests inputs from an external source to improve its modelling capabilities.
It’s most commonly applied when:
- Only unlabeled data is available.
- The goal is to train supervised models (classifiers or regressors).
- The external source is a human expert that provides labels, and the labeling process is expensive and/or slow.
Active learning strategies are also useful when, as in the case of InfoSec, the data changes fast and/or in the presence of concept drift (changes in the labels over time). In these cases, to adjust to the changes in the data, new labeled examples are continuously needed to retrain the models.
How does active learning work?
Active learning strategies start with a pool of unlabeled data and repeat the two following steps until satisfactory performance is reached or until the labeling budget is met:
- Select a small subset of examples and request their labels from the human expert
- Fit a supervised model with the available labeled examples
At each iteration of this process, the number of labeled examples available to train the supervised model increases and, as a result, model accuracy improves. To reduce the total labeling cost, it is critical to select carefully the subset of examples to be labeled. In fact, when optimal selection strategies are adopted, the data requirements can decrease drastically.
We will be writing about the benefits and challenges involved in applying active learning strategies to solve InfoSec problems in the coming weeks, so now is the best time to subscribe to the PatternEx blog, and follow us on Twitter and LinkedIn.
If you want to learn more...
Active learning is a mature research field and there is great documentation available. We recommend the ICML 2009 tutorial on active learning by Sanjoy Dasgupta and John Langford. For a deeper, more theoretical introduction to active learning, we recommend reading: