Malware

SoReL-20M Sophos & ReversingLabs release 10 million disarmed samples for malware study

Sophos and ReversingLabs released SoReL-20M, a database containing 20 million Windows Portable Executable files, including 10M malware samples.

Sophos and ReversingLabs announced the release of SoReL-20M, a database containing 20 million Windows Portable Executable files, including 10 million malware samples.

The SoReL-20M database includes a set of curated and labeled samples and security-relevant metadata that could be used as a training dataset for a machine learning engine used in anti-malware solutions.

The availability of large and well-formed training sets is a major problem for the implementation of machine learning models.

“The Sophos AI team is excited to announce the release of SOREL-20M (Sophos-ReversingLabs – 20 million) – a production-scale dataset containing metadata, labels, and features for 20 million Windows Portable Executable files, including 10 million disarmed malware samples available for download for the purpose of research on feature extraction to drive industry-wide improvements in security.” reads the post published by Sophos.

SOREL-20M is the first production-scale malware research dataset publicly available released with the intent to accelerate research for malware detection via machine learning.

The experts pointed out that a large number of curated and labeled samples is very expensive and difficult to obtain. The majority of works on malware detection is based on private, internal datasets that could not be shared and that for this reason produce results that cannot be directly compared to each other.

“Unlike image recognition or natural language processing, the area of security has seen much less activity and a relatively slower rate of improvement.  A major reason for this is simply the lack of a standard, large-scale, realistic data set that can be easily obtained and tested by a wide range of users, from independent researchers to academic labs to large corporate groups.” continues Sophos.

The dataset contains features for each malware that have been extracted based on the EMBER 2.0 dataset, labels, detection metadata, and complete binaries,

Experts also released a set of pre-trained PyTorch (https://pytorch.org/) models and LightGBM (https://github.com/Microsoft/LightGBM) models trained on this dataset. Sophos also released scripts that allow to load and iterate over the data, as well as to load, train, and test the models.

Anyway the public availability of training sets like SoReL-20M could also advantage sophisticated attackers that could use them to create new threats but Sophos pointed out that well-resourced attackers could already have access to easy to use and coste effective malware datasets.

For this reason, is essential to give security researchers this dataset and help them to build a new generation of tools that could be effective for malware detection thanks to metadata released alongside the samples.

“That said, while the introduction of machine learning technologies represents a significant leap forward for threat detection at scale, these systems are only as good as the datasets they have access to.” states the announcement published by Reversinglabs.

“All this data gives our customers a well defined dataset of threat intelligence to leverage in their defenses, and as part of their threat hunting programs, to both block active attacks and search for threats that may otherwise be invisible to the traditional security stack.”

[adrotate banner=”9″][adrotate banner=”12″]

Pierluigi Paganini

(SecurityAffairs – hacking, SoReL-20M)

[adrotate banner=”5″]

[adrotate banner=”13″]

Pierluigi Paganini

Pierluigi Paganini is member of the ENISA (European Union Agency for Network and Information Security) Threat Landscape Stakeholder Group and Cyber G7 Group, he is also a Security Evangelist, Security Analyst and Freelance Writer. Editor-in-Chief at "Cyber Defense Magazine", Pierluigi is a cyber security expert with over 20 years experience in the field, he is Certified Ethical Hacker at EC Council in London. The passion for writing and a strong belief that security is founded on sharing and awareness led Pierluigi to find the security blog "Security Affairs" recently named a Top National Security Resource for US. Pierluigi is a member of the "The Hacker News" team and he is a writer for some major publications in the field such as Cyber War Zone, ICTTF, Infosec Island, Infosec Institute, The Hacker News Magazine and for many other Security magazines. Author of the Books "The Deep Dark Web" and “Digital Virtual Currency and Bitcoin”.

Recent Posts

MITRE revealed that nation-state actors breached its systems via Ivanti zero-days

The MITRE Corporation revealed that a nation-state actor compromised its systems in January 2024 by…

9 hours ago

FBI chief says China is preparing to attack US critical infrastructure

China-linked threat actors are preparing cyber attacks against U.S. critical infrastructure warned FBI Director Christopher…

21 hours ago

United Nations Development Programme (UNDP) investigates data breach

The United Nations Development Programme (UNDP) has initiated an investigation into an alleged ransomware attack…

24 hours ago

FIN7 targeted a large U.S. carmaker with phishing attacks

BlackBerry reported that the financially motivated group FIN7 targeted the IT department of a large…

1 day ago

Law enforcement operation dismantled phishing-as-a-service platform LabHost

An international law enforcement operation led to the disruption of the prominent phishing-as-a-service platform LabHost.…

2 days ago

Previously unknown Kapeka backdoor linked to Russian Sandworm APT

Russia-linked APT Sandworm employed a previously undocumented backdoor called Kapeka in attacks against Eastern Europe since…

2 days ago

This website uses cookies.