Machine Learning in the context of SIEM
In this article we are going to see how machine learning was used to help automate a task in cyber security.
In order to see if the activity in a system is generated from outside (hackers) we need to track what is happening on that system. The system registers every action that happens generating logs. If we analyse the logs we are able to find what happened. The problem is that lots of logs are generated because lots of actions take place in a computer and it becomes impossible to track them all. Furthermore, the logs are not appealing to human eye, it is an exhausting task to analyse them.
In order to make tracking easier, the SIEM (Security information and event management: a service that manages the logs) generalises information in logs to a smaller text called SourceType which is summary of the data structure of a log. But these SourceTypes still need a human to be interpreted whether they come from a suspicious activity or not.
MITRE ATT&CK Framework has mapped techniques used by hackers (threat groups) to Data Categories. So if we could know the DataCategory of a log we could say that there might be an ATT&CK technique involved in a log or not.
The problem we tried to solve here at CT Defense was to map SourceTypes to MITRE DataCategories. One approach in order to achieve this goal was to apply NLP techniques and implement a neural network. Firstly, the data(SourceTypes and DataCategories) needed to be cleaned from any characters except spaces. Then the labels, in this case DataCategories had to be transformed to numbers. Spark’s label indexer did the job.
For the next step we have to understand that the computer understands a string as a sequence of characters, not some words separated by spaces. We have to transform the string to tell the computer that between spaces there is a word. This transformation is called tokenization in NLP(natural language processing). SparkML’s tokenizer helped in this case.
Next, to predict to what Data Category a Source Type belongs to, it would be useful to know what words from Source Type are relevant to what Data Categories. TF-IDF(term frequency inverse document frequency) gives us a measure of this relevance we are looking for.
Now, having the predictor words with their TF-IDF scores as features, we can create a neural network. In our case it was a Multi Layer Perceptron with one hidden layer. The accuracy of its predictions on unseen SourceTypes was 62%. And the more SourceTypes it will be trained on the more accurate it will get.
In conclusion, in order to be able to see if somebody hacked us we need to track our logs. It becomes easier to understand and use them if they are abstracted in Source Types. If we map Source Types to MITRE Data Categories we are able to say if there is a suspicious activity on a system. We used a neural net to do the mapping and obtained 62% accuracy on unseen Source Types.
Cybersecurity Engineer & Data Scientist