After 1 Million of malware samples analyzed

Pierluigi Paganini November 25, 2019

Malware Hunter – One year after its launch, Marco Ramilli shared the results of its project that has analyzed more than 1 Million malware samples.

Malware Hunter – One year ago I decided to invest in static Malware Analysis automation by setting up a full-stack environment able to grab samples from common opensources and to process them by using Yara rules. I mainly decided to offer to cybersecurity community a public dataset of pre-processed analyses and let them be “searchable” offering PAPI and a simple visualization UI available HERE. Today after one year is time to check how the project has grown and see its running balance.

How it works

Malware Hunter is a python powered project driven by three main components: collectors, processors and public API. The collector takes from public available sources samples and place them in a local queue waiting to be processed. The processors are multiple single python processes running on a distributed environment which pulling samples from the common queue, process them and save back to a mongodb instance the whole processed result-set. Unfortunately Is cannot store all the analyzed samples due to the storage price which would rise quite quickly, so I manly collect only “reports”: a.k.a what the processor has analyzed. Public API (if you want access to them, please write me) are used to query the mongodb and for submitting new samples. Finally a simple user interface is provided for getting some statistics over time. A periodic task looks for specific matched Yara rules which stands for APT and populates a specific view available HERE. In other words if it finds a match on a well-known Yara rule built to catch Advanced Persistent Threats, it collects the calculated hash and through a dedicated API visualizes stats and matching signatures of all potential APT found. Everything runs without human interaction, and this is good and bad as well. It is good since it can scale up very quickly (depending on the acquired power on the VPS) but on the other hand by having no human interaction, depending on the implementation of Yara rules, the system can have many false positives. BUT what is interesting to me is having a base set of samples from which starting on. Without such a kind of tool, how would you start finding specific threats ? In that specific case I can start from the one matched the specific signature related to the specific threat I want to follow, rather than starting randomly.


After almost one year of fully automated static analyzed samples through Yara rules, Malware Hunter analyzed more than one Million samples, distributed in the following way.

Malware Analyses Distribution

It looks like on April 2019 the engine extracted and analyzed a small set of samples if compared to the general trend, while on late August / first of September it analyzed more than 250k samples. It is interesting to see a significant increase of analyses at the “end of the year” if compared to the analyses performed at the beginning of the same year. While malware collectors collect the same sources over the past year, the engine analyzes only specific file types (such as for example PE and Office files) and assuming the sample sources had not working breaks, it can mean that:

  • More non-in-scope samples have been spread over the time frame between April to June such as for example: HTML, Javascript, VBA etc.. or
  • Malware flow is subject to cyclical trends depending on a multiple topics including political influences and fiscal years.

Observing the most snapped Yara rules it is nice to check that the most analyzed samples were executable files. Many of them (almost 400k) hid a PE file compressed and/or encrypted into themselves.

TOP Matched Rules

Many Yara matches highlight an high presence of anti-debugging techniques, for example: DebuggerTiming_Ticks, DebuggerPatterns_SEH_Inits, Debugger_Checks and isDebuggerPresent, and so on and so forth. If considered together with Create_Process, Embedded_PE and Win_File_Operations bring the analyst to think that modern malware is heavily obfuscated and weaponized against debuggers. From signatures such as: keyloggers and screenshots it’s clear that most of the nowadays malware is recording our keyboard activities and wants to spy on us by getting periodical screenshots. The presence of HTTP and TCP rules underline the way new malware keep getting online either for downloading shellcodes (signature shellcode) and to ask to be controlled from a C2 system (such as a sever). Many samples look like they open-up a local communication port which often hides a local proxy for encrypt communication between the malware and its command and control. Crafted Mutex are very frequent for Malware developers, they are used to delay or to manage the multi infection processes.

Equation Group Signature Matches

Another interesting observation comes fron the way Equation Group Toolset matches.The Equation Group, classified as an advanced persistent threat, is a highly sophisticated threat actor suspected of being tied to the Tailored Access Operations (TAO) unit of the United StatesNational Security Agency (NSA).[1][2][3]Kaspersky Labs describes them as one of the most sophisticated cyber attack groups in the world and “the most advanced … we have seen”, operating alongside but always from a position of superiority with the creators of Stuxnet and Flame.

From Wikipedia

Many EquationGroup_toolset signatures matched during the most characterized detection time frame (at the beginning and at the ending of the year) alerting us that those well-known (August 2016) tools are still up and running and heavily reused over samples. ShadowBrokers released the code on August 2016 and from that time many piece of malware adopted it, still nowadays looks like be actual in many executable samples.

From the slow but interesting page “potential APT detection” (available HERE) we have “live” stats (updated every 24h) on APT matches over the 1 Million analyzed samples. Dragonfly (As Known As Energetic Bear) is what the Malware Hunter mostly matched. According to MalPediaDragonFly is a Russian group that collects intelligence on the energy industry, followed by Regin. According to Kaspersky Lab’s findings, the Regin APT campaign targets telecom operators, government institutions, multi-national political bodies, financial and research institutions and individuals involved in advanced mathematics and cryptography. The attackers seem to be primarily interested in gathering intelligence and facilitating other types of attacks.

Most APT Signature Metches

Many Ursnif/Gozi were detected during the past year. Ursnif/Gozi is a quite (in)famous banking trojan targeting UK/Italy mostly, and attribute to the cybercrime group TA-505 from TrendMicro in late 2018 by spotting common evidences between Ursnif/Gozi and TA-505 banking trojans such as Dridex and the loader Emotet. Interesting to note that quite old rules related to Putter Panda hit in some samples (for example: 1b1c4bc8d5f32b429eac590ec94b1a0780eaf863db99674decb6b6bd9abdf979 and ef046640438ab22d0168017aa75f7137f7a94e30e9f2f16cd65596d0a95a75d2...). Putter Panda is a Chinese threat group that has been attributed to Unit 61486 of the 12th Bureau of the PLA’s 3rd General Staff Department (GSD). The analysis on the found results might go further, but if you are interesting in getting into some details please do not hesitate to contact me, or to use the search field on that page.

Used Infrastructure

While the scrapers and the workers run in remote and domestic PCs, the PAPI server holds both: Public Application Program Interface and the searching scripts (the ones used to match and to alert for specific API matches). The following graphs show the VP usage at a glance.

Server Usage

4 CPUs at 100% most of time. CPUs are used to process Yara rules to build-up DataBase views, to filtering out unwanted samples (for example HTML, Javascript and so on..), for searching and alerting on interesting samples and for periodically enrich pre-calculated reports by adding additional information over time. Disk is mostly used for storing temporary files on separate queues before being processed. The used instance of MongoDB is not hosted on the same machine. The network graph is used to track network load balance between Bytes sent and Bytes received. Almost 2.0Mbps incoming network is the lower bound-rate while 300Kbps is the average on out-bound. This means collectors are grabbing a nice number of new samples per day from public available sources and they push the new samples on the central queue as well. On the other hand PAPI usage looks like taking a lower outbound rate. It makes sense since the PAPI Json result for single request is is way lighter than the sample itself represented from the request.

I hope you enjoy that tool, as free to search samples, to use them to classify your TI and if you need PAPI let me know. I am planning to let it run unless the cost will increase too much for me.

About the author: Marco Ramilli, Founder of Yoroi

I do have experience on security testing since I have been performing penetration testing on several US electronic voting systems. I’ve also been encharged of testing uVote voting system from the Italian Minister of homeland security. I met Palantir Technologies where I was introduced to the Intelligence Ecosystem. I decided to amplify my cyber security experiences by diving into SCADA security issues with some of the most biggest industrial aglomerates in Italy. I finally decided to found Yoroi: an innovative Managed Cyber Security Service Provider developing some of the most amazing cyber security defence center I’ve ever experienced ! Now I technically lead Yoroi defending our customers strongly believing in: Defence Belongs To Humans

[adrotate banner=”9″] [adrotate banner=”12″]

Pierluigi Paganini

(SecurityAffairs – malware hunter, hacking)

[adrotate banner=”5″]

[adrotate banner=”13″]

you might also like

leave a comment