Thiebauld Weksteen, a penetration tester from Melbourne is advising system administrators that robots.txt can give precious details to hackers, when it comes to attacks, because robots.txt as the capability to tell search engines which directories can and cannot be crawled on a web server.
Weksteen, a security expert at Securus Global explained that system administrators offer clues to hackers related the storage of sensitive assets and information by including the path in the robots.txt file. The analysis of robots.txt could help the intruder to target the attack, instead of trying to strike blindly.
“In the simplest cases, it (robots.txt) will reveal restricted paths and the technology used by your servers, ” From a defender perspective, two common fallacies remain; that robots.txt somewhat is acting as an access control mechanism [and that] content will only be read by search engines and not by humans.”
In the same way that portal administrators don’t invest enough time in patching vulnerabilities, robots.txt normally shows the lack of caring, regularly reporting lists of text to disallow.
Reconnaissance is an essential activity for a penetration tester, by harvesting robots.txt it is possible to get precious information for the sensitive assets.
“During the reconnaissance stage of a web application testing, the tester usually uses a list of known subdirectories to brute force the server and find hidden resources.
Depending on the uptake of certain web technologies, it needs to be refreshed on a regular basis. As you may see, the directive disallow gives an attacker precious knowledge on what may be worth looking at. Additionally, if that is true for one site, it is worth checking for another. ”
This is done by using a crawling technique, and as reported by Weksteen, on a total of 59,558 crawled website, 59,436 sent us a response.
“This is a good indicator of the freshness of Common Crawl results. Of these, 37,431 responded with a HTTP status of 200. Of these, 35,376 returned something that looked like a proper robots.txt (i.e., matching at least one standard directive).”
Administrators need to spend more time in hardening their systems, and if possible excluding assets based in generic terms, not specific ones.
Let me suggest you to read the analysis published by Weksteen at the following address
http://thiébaud.fr/robots.txt.html
About the Author Elsio Pinto
Edited by Pierluigi Paganini
(Security Affairs – Robots.txt, hacking)