Browsing by Author "Algiriyage, N."

Now showing 1 - 2 of 2

Offline analysis of web logs to identify offensive web crawlers.
(International Research Symposium on Pure and Applied Sciences, 2017 Faculty of Science, University of Kelaniya, Sri Lanka., 2017) Algiriyage, N.
With the continuous growth and rapid advancement of web based services, the traffic generated by web servers have drastically increased. Analyzing such data, which is normally known as click stream data, could reveal a lot of information about the web visitors. These data are often stored in web server “access log files” and in other related resources. Web clients can be broadly categorized into two groups: web crawlers and human visitors. During recent past, the traffic generated by web crawlers has drastically increased. Web crawlers are programs or automated scripts that scan web pages methodically to create indexes. They traverse the hyperlink structure of the worldwide web to locate and retrieve information. Web crawler programs are alternatively known as web robots, spiders, bots and scrapers. Web crawlers can be used by anyone seeking to collect information available on the Internet. Search engines like Google, Yahoo, MSN and Bing use web crawlers to index web pages to be used in their page ranking process. Web administrators employ crawlers for automating maintenance tasks such as checking for broken hyperlinks and validating HTML codes. Business organizations, market researchers or anyone can gather specific types of information such as e-mail address, corporate news and product prices. A recent threat with web crawlers is some try to crawl web sites hiding their own identity and pretending to be someone else. Since Google is the widely used search engine globally and web site owners do not want to block the Googlebot, imposters try to crawl sites impersonating Googlebot. The fact is they get privileged access to web sites using the identity of “Googlebot”. Googlebot impersonation can lead to spamming, information theft including business intelligence, or even application level DDoS (Distributed Denial of Service attacks). Although there were recent news items of fake Googlebots, current understanding of this problem is minimal. While it is possible to later identify (e.g. using web server access log files) these Googlebot imposters by using a reverse DNS (Domain Name System) lookup and a forward DNS lookup case by case basis, doing this real time will be much useful but challenging. We observed multiple instances of PHP remote code execution vulnerability scans by these fake Googlebots in our test data sets. Off-line or postmortem analysis of web server access log files could give a deep understanding of traffic patterns, and especially to identify offensive web clients. Although the detection is after-the-fact, proactive strategies can be formulated based on the gathered knowledge. This research proposes a methodology to detect malicious web crawlers based on seven behavioral features including hit rate, blank referrer, hidden links, IP verification, IP blacklist checks, access of “robots.txt” file and the access depth. The results show that 36.23% of the crawler sessions exhibit malicious crawling patterns.
Prediction of type 2 diabetes risk factor using machine learning in Sri Lanka
(Research Symposium on Pure and Applied Sciences, 2018 Faculty of Science, University of Kelaniya, Sri Lanka, 2018) Menike, R. M. S. D.; Jayalal, S. G. V. S.; Algiriyage, N.
Diabetes mellitus is in third place in the index of 20 major diseases affecting deaths in Sri Lanka. Diagnosis of diabetes is a key and insipid task. A successful, easy and correct method has not been identified to identify the diabetes mellitus in the early stage. Currently, the Diabetes detection is done using blood tests, such as Glycated hemoglobin (A1C) test, Random blood sugar test, Fasting Plasma Glucose test, Oral Glucose Tolerance Test, and Blood Sugar Series. People who do not have a special condition are generally unwilling to go for a blood test, which is a process that costs them time and money. Diabetes mellitus cannot be fully cured, but if identified in prediabetes, it is possible to prevent prediabetes from developing into type II by the actions such as eating healthy foods, losing weight, being physically active. As there are no regular medical checkups to diagnose pre-diabetes among the general public, identification of pre-diabetes is problematic in Sri Lanka. Machine learning techniques have been successfully applied to predict the risk factor for diabetes mellitus in other countries. Due to the high variance of economic and cultural factors, it is very difficult to come up with a common model to all countries. The detection of diabetes from some important risk factors is a multi-layered process. This research is primarily aimed at identifying factors that contribute to the prevalence of diabetes in Sri Lanka and finding a mechanism to predict the risk of diabetes through the use of machine learning algorithms over the identified factors. The gathered dataset consists of anthropometric and behavioral data of a set of people who have diabetes and don’t have, such as age, BMI, gender, heredity, and Hypertension etc. Wrapper methods are used to identify the most influential factors of these factors that affect diabetes mellitus. Since earlier studies have shown better performances, Support Vector Machine, J48, Random Forest algorithms are used for classification of selected dataset. As a result, three models are generated, and the performance of each model is measured analyzing measurements such as accuracy, specificity sensitivity. The outcome model of the study is the one that shows the best performance. That model presented by this scrutiny as the final output can be used by the public without specific domain knowledge, provide a more accurate clue of the diabetes risk of themselves by giving the data related to the identified factors as inputs.