The first entry in a series to introduce the PatternEx ThreatEx Labs Domain Detection Initiative (aka finding malicious domains before anyone else!).
What started as a simple prototype to understand whether we could use machine learning to:
- detect malicious domains missed by existing blacklists
- detect malicious domains faster than existing blacklists
has grown into a broad and ambitious project. We think this will appeal to anyone interested in the intersection of cybersecurity and AI, from hard-core security profiles to AI experts. If that resonates with you, we bet you will enjoy this series!
As part of this initiative, we will provide a daily feed of malicious domains, release labeled datasets, explain open-source ML projects for domain detection, and propose a series of challenges to the community. But before we jump into details, we wanted to provide a quick idea of what motivated us to start the project in the first place.
Why domain detection—doesn’t everyone do this?
From a research perspective, using ML to detect malicious domains is not uncharted territory (check out this survey to know more). At the same time, most organizations have solutions in place to block malicious domains based on threat intelligence feeds, which in turn leverage ML to identify IoCs. All that is good but... the fact is malicious domains are still slipping through the cracks.
For instance, we repeatedly observe so-called social engineering domains in production traffic of different organizations. These domains (supposedly) trick the users into visiting malicious sites under false promises such as winning a prize, downloading free software, or even freeing the system of a detected virus—note the irony in this last one! Below we show a hand-picked list of domains observed in live traffic, together with their detection history on VirusTotal, that will help us make a few points:
Last scan VT URL
2018-05-25 2/67 http://thesoftware-center2updating[.]trade/
2018-06-03 2/67 http://goodcenter2updating[.]review/
2018-06-15 1/67 http://systemcenter2updating[.]download/
2018-06-19 4/67 http://systemcenter2updating[.]review/
2018-08-04 4/67 http://bestcenter2updatingsoft[.]date/
2018-08-17 4/70 http://yourbiggestcenter2updating[.]review/
2018-08-17 4/70 http://yourbiggestcenter2updating[.]stream/
2018-08-19 6/70 http://goodcenter2updatingsoft[.]win/
2018-08-20 5/67 http://bestcenter2updatingsoft[.]bid/
2018-08-20 4/67 http://goodcenter2updatingsoft[.]trade/
2018-09-24 6/68 http://goodcenter2updating[.]download/
2018-09-29 7/68 http://bestcenterplaceforstreamsafe[.]review/
2018-10-02 3/67 http://thesoftware-center2updating[.]win/
2018-10-03 8/68 http://bestcenterplace4streams[.]stream/
- Many of these domains are reported as malicious on VirusTotal.
- The scan dates show that some of these domains were investigated some five months ago (2018-05-25), while others have been recently scanned (2018-10-03); this indicates the domains are likely to be active.
Given that we saw a large number of such domains, we started to think that the generation and registration of these domains would have to be performed in an automated manner on an ongoing basis and that, as a consequence, the generated domains exhibit distinctive patterns that can be leveraged for automated identification. This means that, to complement blacklists, we can use machine learning models trained with examples of malicious and benign domains to detect new malicious domains. Moreover, it is our opinion that the only way to counter large scale malicious infrastructure such as the one behind these domains is to automate the detection process. This way, we set ourselves us to develop a machine learning model to detect these domains, and verify whether we could detect malicious domains missed by existing blacklists.
Step 1: Us vs Them (VirusTotal)
We went ahead and trained and deployed the first iteration of detection models, all while keeping track of the detections to compare whether (and when) the detected domains were reported on VirusTotal. That analysis showed that we were identifying many domains weeks before they were reported on any blacklist. These results will be presented at IEEE BigData conference in December 2018 (http://cci.drexel.edu/bigdata/bigdata2018/). Here is a quick advance of a few domains identified with a machine learning model that remain unreported by VirusTotal as of the day of this writing (Oct 10, 2018):
Last scan VT URL
2018-10-10 0/67 http://bestcenterplace4streams[.]review/
2018-10-10 0/67 http://bestcenterplaceforstream[.]date/
2018-10-10 0/67 http://bestcenterplaceforstreamfree[.]stream/
2018-10-10 0/67 http://bestcenterplacestreaming[.]date/
2018-10-10 0/67 http://bestcenterplacestreamsafe[.]trade/
These domains, although similar to the ones shown above, have not been reported on VirusTotal as of the date of this writing (2018-10-10). There is even a case where the exact same SLD (bestcenterplace4streams) is reported by 8/68 solutions under .stream but shows as clean under .review. In the other cases, small variations of the SLDs reveal enough to bypass detection.
Step 2: Going closer to the source
Having proved the value added by the detection models, we decided to expand the analysis beyond the network traffic observed at several organizations. It turns out that there are invaluable open sources of production, current data that allow us to very quickly identify many of these domains. In particular, we mine the Certificate Transparency Network logs and DNS zones files retrieved via the Centralized Zone Data Service . The next figure shows a schematic representation of the strategy implemented to quickly identify malicious domains:
Schematic representation of the domain detection initiative. We continuously retrieve and analyze live data sources to detect unreported malicious domains.
Getting hands on
The ins and outs of this project are too much content for a single blog post, so we will be describing the resulting domain feed, datasets, code, and acquired insights in a series of blogs to come.
The series will be comprised of the following entries:
- Blog 2 - The ThreatEx domain feed
- Blog 3 - Using deep learning to detect social engineering domains (including code samples and datasets)
- Blog 4 - Using StreamingPhish to rapidly detect phishing domains (including end-to-end code)
- Blog 5 - ThreatEx’s AI framework for domain detection
- Blog 6 - The ThreatEx threat anticipation challenge (including code and dataset releases)
If you have any ideas, comments, or feedback you want to share with us please don’t hesitate to get in touch with us at firstname.lastname@example.org.