Offline analysis of web logs to identify offensive web crawlers.

Algiriyage, N.

UoK Repository Home
→
Science
→
Symposia & Conferences
→
International Research Symposium on Pure and Applied Sciences (IRSPAS)
→
IRSPAS 2017
→
View Item

dc.contributor.author	Algiriyage, N.
dc.date.accessioned	2017-11-20T10:47:29Z
dc.date.available	2017-11-20T10:47:29Z
dc.date.issued	2017
dc.identifier.citation	Algiriyage,N. (2017). Offline analysis of web logs to identify offensive web crawlers. International Research Symposium on Pure and Applied Sciences, 2017 Faculty of Science, University of Kelaniya, Sri Lanka.p.116.	en_US
dc.identifier.uri	http://repository.kln.ac.lk/handle/123456789/18129
dc.description.abstract	With the continuous growth and rapid advancement of web based services, the traffic generated by web servers have drastically increased. Analyzing such data, which is normally known as click stream data, could reveal a lot of information about the web visitors. These data are often stored in web server “access log files” and in other related resources. Web clients can be broadly categorized into two groups: web crawlers and human visitors. During recent past, the traffic generated by web crawlers has drastically increased. Web crawlers are programs or automated scripts that scan web pages methodically to create indexes. They traverse the hyperlink structure of the worldwide web to locate and retrieve information. Web crawler programs are alternatively known as web robots, spiders, bots and scrapers. Web crawlers can be used by anyone seeking to collect information available on the Internet. Search engines like Google, Yahoo, MSN and Bing use web crawlers to index web pages to be used in their page ranking process. Web administrators employ crawlers for automating maintenance tasks such as checking for broken hyperlinks and validating HTML codes. Business organizations, market researchers or anyone can gather specific types of information such as e-mail address, corporate news and product prices. A recent threat with web crawlers is some try to crawl web sites hiding their own identity and pretending to be someone else. Since Google is the widely used search engine globally and web site owners do not want to block the Googlebot, imposters try to crawl sites impersonating Googlebot. The fact is they get privileged access to web sites using the identity of “Googlebot”. Googlebot impersonation can lead to spamming, information theft including business intelligence, or even application level DDoS (Distributed Denial of Service attacks). Although there were recent news items of fake Googlebots, current understanding of this problem is minimal. While it is possible to later identify (e.g. using web server access log files) these Googlebot imposters by using a reverse DNS (Domain Name System) lookup and a forward DNS lookup case by case basis, doing this real time will be much useful but challenging. We observed multiple instances of PHP remote code execution vulnerability scans by these fake Googlebots in our test data sets. Off-line or postmortem analysis of web server access log files could give a deep understanding of traffic patterns, and especially to identify offensive web clients. Although the detection is after-the-fact, proactive strategies can be formulated based on the gathered knowledge. This research proposes a methodology to detect malicious web crawlers based on seven behavioral features including hit rate, blank referrer, hidden links, IP verification, IP blacklist checks, access of “robots.txt” file and the access depth. The results show that 36.23% of the crawler sessions exhibit malicious crawling patterns.	en_US
dc.language.iso	en	en_US
dc.publisher	International Research Symposium on Pure and Applied Sciences, 2017 Faculty of Science, University of Kelaniya, Sri Lanka.	en_US
dc.subject	Web-crawler	en_US
dc.subject	Web server access logs	en_US
dc.subject	Web usage mining	en_US
dc.title	Offline analysis of web logs to identify offensive web crawlers.	en_US
dc.type	Article	en_US