How hackers use Robots txt to harvest information

Pierluigi Paganini May 19, 2015

The penetration tester Thiebauld Weksteen has published an interesting analysis to explaine the importance of robots.txt for the hacking activities.

Thiebauld Weksteen, a penetration tester from Melbourne is advising system administrators that robots.txt can give precious details to hackers, when it comes to attacks, because robots.txt as the capability to tell search engines which directories can and cannot be crawled on a web server.

Weksteen, a security expert at Securus Global explained that system administrators offer clues to hackers related the storage of sensitive assets and information by including the path in the robots.txt file. The analysis of robots.txt could help the intruder to target the attack, instead of trying to strike blindly.

“In the simplest cases, it (robots.txt) will reveal restricted paths and the technology used by your servers, ” From a defender perspective, two common fallacies remain; that robots.txt somewhat is acting as an access control mechanism [and that] content will only be read by search engines and not by humans.”

In the same way that portal administrators don’t invest enough time in patching vulnerabilities, robots.txt normally shows the lack of caring, regularly reporting lists of text to disallow.

robots txt

Reconnaissance is an essential activity for a penetration tester, by harvesting robots.txt it is possible to get precious information for the sensitive assets.

“During the reconnaissance stage of a web application testing, the tester usually uses a list of known subdirectories to brute force the server and find hidden resources.

Depending on the uptake of certain web technologies, it needs to be refreshed on a regular basis. As you may see, the directive disallow gives an attacker precious knowledge on what may be worth looking at. Additionally, if that is true for one site, it is worth checking for another. ”

This is done by using a crawling technique, and as reported by Weksteen, on a total of 59,558 crawled website, 59,436 sent us a response.

“This is a good indicator of the freshness of Common Crawl results. Of these, 37,431 responded with a HTTP status of 200. Of these, 35,376 returned something that looked like a proper robots.txt (i.e., matching at least one standard directive).”

Administrators need to spend more time in hardening their systems, and if possible excluding assets based in generic terms, not specific ones.

Let me suggest you to read the analysis published by Weksteen at the following address
http://thiébaud.fr/robots.txt.html

About the Author Elsio Pinto

Elsio Pinto (@high54security) is at the moment the Lead Mcafee Security Engineer at Swiss Re, but he also as knowledge in the areas of malware research, forensics, ethical hacking. He had previous experiences in major institutions being the European Parliament one of them. He is a security enthusiast and tries his best to pass his knowledge. He also owns his own blog http://high54security.blogspot.com/

Edited by Pierluigi Paganini

(Security Affairs – Robots.txt, hacking)



you might also like

leave a comment