Malware

SoReL-20M Sophos & ReversingLabs release 10 million disarmed samples for malware study

Sophos and ReversingLabs released SoReL-20M, a database containing 20 million Windows Portable Executable files, including 10M malware samples.

Sophos and ReversingLabs announced the release of SoReL-20M, a database containing 20 million Windows Portable Executable files, including 10 million malware samples.

The SoReL-20M database includes a set of curated and labeled samples and security-relevant metadata that could be used as a training dataset for a machine learning engine used in anti-malware solutions.

The availability of large and well-formed training sets is a major problem for the implementation of machine learning models.

“The Sophos AI team is excited to announce the release of SOREL-20M (Sophos-ReversingLabs – 20 million) – a production-scale dataset containing metadata, labels, and features for 20 million Windows Portable Executable files, including 10 million disarmed malware samples available for download for the purpose of research on feature extraction to drive industry-wide improvements in security.” reads the post published by Sophos.

SOREL-20M is the first production-scale malware research dataset publicly available released with the intent to accelerate research for malware detection via machine learning.

The experts pointed out that a large number of curated and labeled samples is very expensive and difficult to obtain. The majority of works on malware detection is based on private, internal datasets that could not be shared and that for this reason produce results that cannot be directly compared to each other.

“Unlike image recognition or natural language processing, the area of security has seen much less activity and a relatively slower rate of improvement.  A major reason for this is simply the lack of a standard, large-scale, realistic data set that can be easily obtained and tested by a wide range of users, from independent researchers to academic labs to large corporate groups.” continues Sophos.

The dataset contains features for each malware that have been extracted based on the EMBER 2.0 dataset, labels, detection metadata, and complete binaries,

Experts also released a set of pre-trained PyTorch (https://pytorch.org/) models and LightGBM (https://github.com/Microsoft/LightGBM) models trained on this dataset. Sophos also released scripts that allow to load and iterate over the data, as well as to load, train, and test the models.

Anyway the public availability of training sets like SoReL-20M could also advantage sophisticated attackers that could use them to create new threats but Sophos pointed out that well-resourced attackers could already have access to easy to use and coste effective malware datasets.

For this reason, is essential to give security researchers this dataset and help them to build a new generation of tools that could be effective for malware detection thanks to metadata released alongside the samples.

“That said, while the introduction of machine learning technologies represents a significant leap forward for threat detection at scale, these systems are only as good as the datasets they have access to.” states the announcement published by Reversinglabs.

“All this data gives our customers a well defined dataset of threat intelligence to leverage in their defenses, and as part of their threat hunting programs, to both block active attacks and search for threats that may otherwise be invisible to the traditional security stack.”

[adrotate banner=”9″][adrotate banner=”12″]

Pierluigi Paganini

(SecurityAffairs – hacking, SoReL-20M)

[adrotate banner=”5″]

[adrotate banner=”13″]

Pierluigi Paganini

Pierluigi Paganini is member of the ENISA (European Union Agency for Network and Information Security) Threat Landscape Stakeholder Group and Cyber G7 Group, he is also a Security Evangelist, Security Analyst and Freelance Writer. Editor-in-Chief at "Cyber Defense Magazine", Pierluigi is a cyber security expert with over 20 years experience in the field, he is Certified Ethical Hacker at EC Council in London. The passion for writing and a strong belief that security is founded on sharing and awareness led Pierluigi to find the security blog "Security Affairs" recently named a Top National Security Resource for US. Pierluigi is a member of the "The Hacker News" team and he is a writer for some major publications in the field such as Cyber War Zone, ICTTF, Infosec Island, Infosec Institute, The Hacker News Magazine and for many other Security magazines. Author of the Books "The Deep Dark Web" and “Digital Virtual Currency and Bitcoin”.

Recent Posts

Fintech firm Figure disclosed data breach after employee phishing attack

Fintech firm Figure confirmed a data breach after hackers used social engineering to trick an…

19 hours ago

U.S. CISA adds a flaw in BeyondTrust RS and PRA to its Known Exploited Vulnerabilities catalog

The U.S. Cybersecurity and Infrastructure Security Agency (CISA) adds a flaw in BeyondTrust RS and…

21 hours ago

Suspected Russian hackers deploy CANFAIL malware against Ukraine

A new alleged Russia-linked APT group targeted Ukrainian defense, government, and energy groups, with CANFAIL…

1 day ago

New threat actor UAT-9921 deploys VoidLink against enterprise sectors

A new threat actor, UAT-9921, uses the modular VoidLink framework to target technology and financial…

2 days ago

Attackers exploit BeyondTrust CVE-2026-1731 within hours of PoC release

Attackers quickly targeted BeyondTrust flaw CVE-2026-1731 after a PoC was released, enabling unauthenticated remote code…

2 days ago

Google: state-backed hackers exploit Gemini AI for cyber recon and attacks

Google says nation-state actors used Gemini AI for reconnaissance and attack support in cyber operations.…

2 days ago

This website uses cookies.