Stylometric analysis to track anonymous users in the underground

Pierluigi Paganini January 10, 2013

Law enforcement and intelligence agencies conscious of the high risks related to cyber threats have started massive monitoring campaign, everything must be controlled to avoid unpleasant surprises. The trend is shared by every governments of the planet, intelligence agencies are making great investments in term of money and resources to define new methods and to develop new tools for monitoring of social media.

One of the most interesting source of information is represented by underground forums, places in the cyberspace where is possible to discuss of every kind of subject and where it is possible to acquire/rent any kind of illegitimate software or service to conduct a cyber attacks.

It’s clear that these forums, despite anonymous participation of their users, represent a mine of information useful for any kind of investigation, but how to bypass anonymity of participants?

According an interesting study presented by researcher Sadia Afroz at last edition of Chaos Communication Congress in Germany, the 29C3, up to 80 percent of certain anonymous underground forum users can be identified using linguistics, a data that is stunning in my opinion. Sadia is member of the The Drexel and George Mason universities research team composed of Aylin Caliskan Islam, Ariel Stolerman, Rachel Greenstadt, and Damon McCoy.


The method adopted by researchers is based on the comparison of various posts across forums and other social media, such as social networks or blogs.

The researchers declared that every user tend to adopt his own writing style during his internet experience, peculiarity that make it identifiable. The identification is possible thanks the analysis of “function words” that are words that serve to express grammatical relationships with other words within a sentence and are strongly related the attitude or mood of the speaker.

“If our dataset contains 100 users we can at least identify 80 of them,” “Function words are very specific to the writer. Even if you are writing a thesis, you’ll probably use the same function words in chat messages.” “Even if your text is not clean, your writing style can give you away.” 

Digging in the underground forums it is possible to reveal the identity of a cyber criminal that try to sell malicious code or of a terrorist that is trying to communicate within an hidden cell. The study demonstrate the enormous possibility of the technique that could be used also to characterize a single forum and its audience, discovering the relationships of its users with other underground communities, information that is very useful for investigation to track cyber places attended by particular categories of individuals.

The project is very attractive, future develop and improvement could allow to trace anonymous profiles, researcher Aylin Caliskan Islam anticipated that future versions will include temporal information, according to Information Theory science (Metzger, 2007),“timeliness or currency is one of the key 5 aspects that determine a document’s credibility besides relevance, accuracy, objectivity and coverage”.

The technique will allow to link temporal information of user’s posts with IP addresses used for the connections, this information help to localize the physical place used for internet accesses.

The researchers adopted technique for authorship attribution such the stylometric analysis also used in forensic linguistics verifying the capability of method of tracking also against automated framework like Jstylo used to protect user’s privacy and anonymity.

Another interesting tool is Anonymouth,  it is an authorship recognition circumvention tool. It is based on Jstylo framework and provides many interesting features such as an interactive editor to evade authorship that assists users in changing text and writing style using a dictionary and suggesting synonymous.

Jstylo was presented last year during a previous edition of the Chaos Communication Congress in Germany, the 28C3, it is able to obfuscate documents to protect author’s identity from authorship analysis, one of the main problems for the researchers in fact is to detect writing style deception.

The main methods of circumventing writing style analysis are:

  • Obfuscation – An author attempts to write a document in such a way that their personal writing style will not be recognized.

  • Imitation – An author attempts to write a document such that the writing style will be recognized as that of another specific author.

  • Translation – Machine translation is used to translate a document to one or more languages and then back to the original language.

The technique proposed during the 29C3 was tested across millions of posts from tens of thousands of users of a series of multilingual underground websites including,,,,,,, and

Following the results obtained with the method:

  • Discovered up to 300 distinct discussion topics in the forums related to various malicious activities such as password cracking and black SEO.
  • The technique could be performed only using a minimum text length of 5000 words to limit the number of results.
  • To improve researches on specific topic such as exploits and drugs  the method needs to separate product information from conversational data to facilitate machine learning to automate the process.
  • The technique is more efficient translating the post in English, successes raises up from 66% to around 80, but free translator tools like Google and Bing are not efficient for the purpose.
  • Leetspeak, an alternative alphabet popular in some forum circles, cannot be translated.





The process is still in a first phase, once demonstrated that the method of research is useful and efficient the next step will be to automate it. Researcher Sadia Afroz said:

“We want to automate the whole process.”

“We aren’t trying to identify users, we are trying to show them that this is possible,” she said.

Certainly in the next few years will see some good, the research, as declared, will include more user-specific features and temporal information, the methods will be able to identify multiple account holders adding topic information with authorship data … but be sure that someone already work in the opposite direction to ensure its anonymity 😉

Pierluigi Paganini

you might also like

leave a comment