ThreatEx Labs

PatternEx ThreatEx Labs features timely and actionable insights from world-class security researchers with AI expertise.
Readers will enhance their skill sets with tools and knowledge to efficiently leverage AI for information security.

Demoing RELK: The Research Elastic Stack

If you are like me, a typical day usually involves analyzing thousands to millions of events from malware PCAP (Packet Capture) data or from logs generated in the laboratory. And the bread-and-butter for any data analysis usually starts with the Elastic stack (Logstash, Elasticsearch, & Kibana) and Jupyter Notebooks (with Apache Spark, Python and R). These tools enable organization, visualization, and execute advanced computation on attack logs. Although they are both incredible in their own right, each of the tools serve a unique purpose.

Data Analysis in a Nutshell(1)

 

RELK

To make this whole process easier, I created an infrastructure called RELK.  The goal for RELK is to create an open-source tool that makes it just as easy to analyze data as it is to collect it.

RELK_Overview

RELK features:

  • Kafka: A distributed event streaming platform capable of handling trillions of events a day
  • Beats: A lightweight single-purpose data shipper from Elastic
  • Elasticsearch: A highly scalable search and analytics engine
  • Logstash: A dynamic data collection pipeline with an extensible plugin ecosystem.
  • Kibana: An analytics and visualization platform designed to work with Elasticsearch.
  • ES-Hadoop: A library that allows Hadoop jobs (& therefore Spark) to interact with Elasticsearch.
  • Spark: A fast and general-purpose cluster computing system. It provides high-level APIs in Scala, Python and R.
  • GraphFrames: A package for Apache Spark which provides DataFrame-based Graphs.
  • Jupyter Notebook: A web application that allows you to create interactive notebooks.

Each of these services are running in their own Docker container and it all can be ready to go with one command:

docker-compose up

Demo - PatternEx Malicious Domain Feed

As an example, I will go through the process of ingesting results from our PatternEx Domain Feed.

First, let’s pull and take a look at the dataset:

$ git clone https://github.com/patternex/patternex-feed

The dataset (detections.csv) consists of:
  • the detection date (yyyy-mm-dd)
  • the reported domain (ex. windowdowngradegreataflash[.]icu)
  • the category (ex. social engineering)
  • the detection score (the probability of the domain being malicious)

 

Preparing RELK

Let’s pull the RELK repository and set up our environment:

$ git clone https://github.com/kpolley/RELK

Each service has its own directory and is set up for general use. If any modification is needed, they can be found in their respective directories. 

In our case, all we will need to do is:
 
  1. Set up data ingestion
  2. Create the Logstash template
  3. Create the Elasticsearch index template

 

Data Ingestion

There are a few ways we can ingest data:

  1. Logstash can read directly from a file
    • This is ideal if you're working with a dataset that is static and not changing

  2. Configure FileBeat to periodically read and send data to Logstash
    • This is ideal if you're working with a dataset that is being saved locally and is constantly updating

  3. Send logs to a Kafka topic to be read by logstash
    • This is ideal if the dataset can be streamed directly to Kafka.

For our use case, we have a dataset saved as a CSV file that is not static (updated on a daily basis), so we will use FileBeat.

By default, FileBeat will read and send any data inside RELK/filebeat/input_files. This can be configured, but for now let's move our dataset into the input_files directory:

$ mv patternex-feed/detections.csv RELK/filebeat/input_files

 

Data Filtering

Most of the data processing and filtering will be within Logstash. Let's create a logstash template to parse & add fields.

RELK/logstash/pipelines/ptrx_domain_detection.conf
 
input {
    beats {
        port => 5044
        add_field => { log_type => 'ptrx-domain-detection' }
     }
}
filter {
     if [log_type] == 'ptrx-domain-detection' {
         # Here we can do some extra modification to the logs
         ## such as parsing the domain name and extracting the TLD
         grok {
             match => [ "domain", "%{DATA:domain_name}\.%{GREEDYDATA:tld}" ]
         }
     }
}
output {
     if [log_type] == 'ptrx-domain-detection' {
         stdout { codec => rubydebug } # So that we can see the output in our console
 
         elasticsearch {
             hosts => ["relk-elasticsearch:9200"]
             index => "ptrx-domain-detection"
         }
     }
}

 

Elasticsearch DataType Mapping

By default, Elasticsearch will try to automatically map the data -- numbers will be stored as an Integer datatype, dates will be stored as a Date datatype, and everything else will be stored as Strings. This is fine, however, Elasticsearch can perform advanced analytics such as lat/log coordinates or IP address subnetting. In order to take advantage of this, Elasticsearch needs to know the datatype of these fields. Here's one for our dataset:

RELK/elasticsearch/output_template/ptrx_domain_template.json
 
{
    "index_patterns": [ "ptrx-domain-detection" ],
    "version": 20190315,
    "settings": {i
        "index.refresh_interval": "5s"
     },
     "mappings":{
         "properties":{
             "detection_date":{"type":"date"},
             "domain":{"type":"keyword"},
             "domain_name":{"type":"keyword"},
             "tld":{"type":"keyword"},
             "score":{"type":"double"}
        }
     }
}
 

Starting RELK & Analyzing Data

Start RELK by simply going to the RELK directory and running $ docker-compose up

After a few minutes, all of the services will be up and running and you will see the data flow through Logstash and Kibana

 

 

 

 

Kibana is accessible via localhost:5601 and Jupyter Lab via localhost:8888  (The password is 'research')

kibana_screenshot2

notebook_screenshot2

 

Perhaps in the next blog post I will analyze the malicious domains, but for now I wish your research endeavors be successful and your questions be answered (with data!).

If you have any issues, suggestions, or help with RELK please feel free to email me at kyle@patternex.com.

Topics: analytics

PatternEx Threat Prediction Platform Architecture

Learn how PatternEx dynamically accepts security analysts feedback to create predictive models that continuously adapt to detect new and existing threats. Using this feedback PatternEx is continuously trained to improve detection accuracy. Download the white paper to learn more.

Download Now