Malware

SoReL-20M Sophos & ReversingLabs release 10 million disarmed samples for malware study

Sophos and ReversingLabs released SoReL-20M, a database containing 20 million Windows Portable Executable files, including 10M malware samples.

Sophos and ReversingLabs announced the release of SoReL-20M, a database containing 20 million Windows Portable Executable files, including 10 million malware samples.

The SoReL-20M database includes a set of curated and labeled samples and security-relevant metadata that could be used as a training dataset for a machine learning engine used in anti-malware solutions.

The availability of large and well-formed training sets is a major problem for the implementation of machine learning models.

“The Sophos AI team is excited to announce the release of SOREL-20M (Sophos-ReversingLabs – 20 million) – a production-scale dataset containing metadata, labels, and features for 20 million Windows Portable Executable files, including 10 million disarmed malware samples available for download for the purpose of research on feature extraction to drive industry-wide improvements in security.” reads the post published by Sophos.

SOREL-20M is the first production-scale malware research dataset publicly available released with the intent to accelerate research for malware detection via machine learning.

The experts pointed out that a large number of curated and labeled samples is very expensive and difficult to obtain. The majority of works on malware detection is based on private, internal datasets that could not be shared and that for this reason produce results that cannot be directly compared to each other.

“Unlike image recognition or natural language processing, the area of security has seen much less activity and a relatively slower rate of improvement.  A major reason for this is simply the lack of a standard, large-scale, realistic data set that can be easily obtained and tested by a wide range of users, from independent researchers to academic labs to large corporate groups.” continues Sophos.

The dataset contains features for each malware that have been extracted based on the EMBER 2.0 dataset, labels, detection metadata, and complete binaries,

Experts also released a set of pre-trained PyTorch (https://pytorch.org/) models and LightGBM (https://github.com/Microsoft/LightGBM) models trained on this dataset. Sophos also released scripts that allow to load and iterate over the data, as well as to load, train, and test the models.

Anyway the public availability of training sets like SoReL-20M could also advantage sophisticated attackers that could use them to create new threats but Sophos pointed out that well-resourced attackers could already have access to easy to use and coste effective malware datasets.

For this reason, is essential to give security researchers this dataset and help them to build a new generation of tools that could be effective for malware detection thanks to metadata released alongside the samples.

“That said, while the introduction of machine learning technologies represents a significant leap forward for threat detection at scale, these systems are only as good as the datasets they have access to.” states the announcement published by Reversinglabs.

“All this data gives our customers a well defined dataset of threat intelligence to leverage in their defenses, and as part of their threat hunting programs, to both block active attacks and search for threats that may otherwise be invisible to the traditional security stack.”

[adrotate banner=”9″][adrotate banner=”12″]

Pierluigi Paganini

(SecurityAffairs – hacking, SoReL-20M)

[adrotate banner=”5″]

[adrotate banner=”13″]

Pierluigi Paganini

Pierluigi Paganini is member of the ENISA (European Union Agency for Network and Information Security) Threat Landscape Stakeholder Group and Cyber G7 Group, he is also a Security Evangelist, Security Analyst and Freelance Writer. Editor-in-Chief at "Cyber Defense Magazine", Pierluigi is a cyber security expert with over 20 years experience in the field, he is Certified Ethical Hacker at EC Council in London. The passion for writing and a strong belief that security is founded on sharing and awareness led Pierluigi to find the security blog "Security Affairs" recently named a Top National Security Resource for US. Pierluigi is a member of the "The Hacker News" team and he is a writer for some major publications in the field such as Cyber War Zone, ICTTF, Infosec Island, Infosec Institute, The Hacker News Magazine and for many other Security magazines. Author of the Books "The Deep Dark Web" and “Digital Virtual Currency and Bitcoin”.

Recent Posts

Meta stopped covert operations from Iran, China, and Romania spreading propaganda

Meta stopped three covert operations from Iran, China, and Romania using fake accounts to spread…

10 hours ago

US Treasury sanctioned the firm Funnull Technology as major cyber scam facilitator

The U.S. sanctioned Funnull Technology and Liu Lizhi for aiding romance scams that caused major…

19 hours ago

ConnectWise suffered a cyberattack carried out by a sophisticated nation state actor<gwmw style="display:none;"></gwmw><gwmw style="display:none;"></gwmw>

ConnectWise detected suspicious activity linked to a nation-state actor, impacting a small number of its…

22 hours ago

Victoria’s Secret ‘s website offline following a cyberattack

Victoria’s Secret took its website offline after a cyberattack, with experts warning of rising threats…

2 days ago

China-linked APT41 used Google Calendar as C2 to control its TOUGHPROGRESS malware

Google says China-linked group APT41 controlled malware via Google Calendar to target governments through a…

2 days ago

New AyySSHush botnet compromised over 9,000 ASUS routers, adding a persistent SSH backdoor.

GreyNoise researchers warn of a new AyySSHush botnet compromised over 9,000 ASUS routers, adding a…

2 days ago