Symposia & Conferences
Permanent URI for this communityhttp://repository.kln.ac.lk/handle/123456789/10213
Browse
2 results
Search Results
Item Offline analysis of web logs to identify offensive web crawlers.(International Research Symposium on Pure and Applied Sciences, 2017 Faculty of Science, University of Kelaniya, Sri Lanka., 2017) Algiriyage, N.With the continuous growth and rapid advancement of web based services, the traffic generated by web servers have drastically increased. Analyzing such data, which is normally known as click stream data, could reveal a lot of information about the web visitors. These data are often stored in web server “access log files” and in other related resources. Web clients can be broadly categorized into two groups: web crawlers and human visitors. During recent past, the traffic generated by web crawlers has drastically increased. Web crawlers are programs or automated scripts that scan web pages methodically to create indexes. They traverse the hyperlink structure of the worldwide web to locate and retrieve information. Web crawler programs are alternatively known as web robots, spiders, bots and scrapers. Web crawlers can be used by anyone seeking to collect information available on the Internet. Search engines like Google, Yahoo, MSN and Bing use web crawlers to index web pages to be used in their page ranking process. Web administrators employ crawlers for automating maintenance tasks such as checking for broken hyperlinks and validating HTML codes. Business organizations, market researchers or anyone can gather specific types of information such as e-mail address, corporate news and product prices. A recent threat with web crawlers is some try to crawl web sites hiding their own identity and pretending to be someone else. Since Google is the widely used search engine globally and web site owners do not want to block the Googlebot, imposters try to crawl sites impersonating Googlebot. The fact is they get privileged access to web sites using the identity of “Googlebot”. Googlebot impersonation can lead to spamming, information theft including business intelligence, or even application level DDoS (Distributed Denial of Service attacks). Although there were recent news items of fake Googlebots, current understanding of this problem is minimal. While it is possible to later identify (e.g. using web server access log files) these Googlebot imposters by using a reverse DNS (Domain Name System) lookup and a forward DNS lookup case by case basis, doing this real time will be much useful but challenging. We observed multiple instances of PHP remote code execution vulnerability scans by these fake Googlebots in our test data sets. Off-line or postmortem analysis of web server access log files could give a deep understanding of traffic patterns, and especially to identify offensive web clients. Although the detection is after-the-fact, proactive strategies can be formulated based on the gathered knowledge. This research proposes a methodology to detect malicious web crawlers based on seven behavioral features including hit rate, blank referrer, hidden links, IP verification, IP blacklist checks, access of “robots.txt” file and the access depth. The results show that 36.23% of the crawler sessions exhibit malicious crawling patterns.Item An approach to personalize learning using big data analytics for higher education(Faculty of Science, University of Kelaniya, Sri Lanka, 2016) Jabir, A.; Rajapakse, C.The concept of BYOD (Bring Your Own Device) has gained popularity in studentcentered learning and higher education institutions make significant investments on improving the wireless network to enhance this. Virtual Learning Environment and Learning Management Systems were introduced and personalization of learning becomes the next milestone. The huge streams of data produced by these Wi-Fi networks makes ground for Big Data analytics to identify opportunities in educational environments to adopt personalized learning. The term ‘Personalization’ refers to the tailoring of content and recommending items by inferring what interests a user based on previous or current interactions with that user, and possibly other users. This research proposes an approach to personalize learning on an online learning platform by providing personalized recommendations of educational web resources, comparative feedback and allocate personalized bandwidths based on the concept of deprioritization (lowering priority ranks of heavy users). Concepts of Big Data analytics and data mining techniques will be used to satisfy the objectives. The approach consists of offline phase (modelling phase) and online phase (recommendation /deprioritization) phase. In the offline phase, models will be developed for recommendation and deprioritization separately. For recommendation a hybrid filtering method will be used. k-Nearest Neighbour, a user-based collaborative filtering technique, will be used with correlation based similarity measure with demographic filtering based on demographic classifiers (faculty, year, General/Special/Honors, GPA) to eliminate the cold start problem. To increase the efficiency and accuracy, k-means clustering will be used as an intermediate step to determine usage clusters to group users exhibiting similar browsing patterns and page clusters to discover pages with similar access patterns. For this the access logs of the University of Kelaniya’s Wi-Fi network will be utilized. The parameters for usage clustering would be the timestamp, web resource and category (education, social networking, gaming etc.) whereas the parameters for page clustering would be category and temporal concepts. In the online phase, first the cluster that the current active user belongs to will be identified and k-NN will be applied on that particular cluster to recommend web resources. These techniques also provide the basis for comparative feedback compared to top scorers of the same area of major. For personalized allocation of bandwidth a separate k-means clustering will be performed to identify heavy users during the offline phase. During the online phase deprioritization will be applied accordingly if the current user belongs to the heavy users cluster and there is a heavy traffic in the network. Cross validation will be used to evaluate the models.