On 2016 I was working hard to find a way to classify Malware families through artificial intelligence (machine learning). One of the first difficulties I met was on finding classified testing set in order to run new algorithms and to test specified features. So, I came up with this blog post and this GitHub repository where I proposed a new testing-set based on a modified version of Malware Instruction Set for Behavior-Based Analysis, also referred as MIST. Since that day I received hundreds of emails from students, researchers and practitioners all around the world asking me questions about how to follow up that research and how to contribute to expanding the training set.
I am so glad that many international researches used my classified Malware dataset as building block for making great analyses and for improving the state of the art on Malware research. Some of them are listed here, but many others papers, articles and researches have been released (just ask to Google).
Today I finally had chance to follow-it-up by adding a scripting section which would be useful to: (i) generate the modified version of MIST files (the one in training sets) and to (ii) convert the obtained results to ARFF (Attribute Relation File Format) by University of Waikato. The first script named
mist_json.py is a reporting module that could be integrated into a running CuckooSandBox environment. It is able to take the cuckoo report and convert it into a modified version of MIST file. To do that, drop
mist_json.py into your running instance of CuckooSandbox V1 (
/) and add the specific configuration section into
reporting.conf. You might decide to force its execution without configuration by editing directly the source code. The result would be a MIST file for each Cuckoo analysed sample. The MIST file wraps out the generated features as described into the original post here. By using the second script named
fromMongoToARFF.py you can convert your JSON object into ARFF which would be very useful to be imported into WEKA for testing your favorite algorithms.
Now, if you wish you are able to generate training sets by yourself and to test new algorithms directly into WEKA. The creation process follows those steps:
If you want to share with the community your new MIST classified files please feel free to make pull requests directly on GitHub. Everybody is using this set will appreciate it.
The original post along many other interesting analysis are available on the Marco Ramilli blog:
About the author: Marco Ramilli, Founder of Yoroi
I am a computer security scientist with an intensive hacking background. I do have a MD in computer engineering and a PhD on computer security from University of Bologna. During my PhD program I worked for US Government (@ National Institute of Standards and Technology, Security Division) where I did intensive researches in Malware evasion techniques and penetration testing of electronic voting systems.
I do have experience on security testing since I have been performing penetration testing on several US electronic voting systems. I’ve also been encharged of testing uVote voting system from the Italian Minister of homeland security. I met Palantir Technologies where I was introduced to the Intelligence Ecosystem. I decided to amplify my cybersecurity experiences by diving into SCADA security issues with some of the biggest industrial aglomerates in Italy. I finally decided to found Yoroi: an innovative Managed Cyber Security Service Provider developing some of the most amazing cybersecurity defence center I’ve ever experienced! Now I technically lead Yoroi defending our customers strongly believing in: Defence Belongs To Humans