PART OF SPEECH TAGGER FOR SINHALA LANGUAGE

Jayaweera, A.J.P.M.P.

Please use this identifier to cite or link to this item: http://repository.kln.ac.lk/handle/123456789/17513

Title:	PART OF SPEECH TAGGER FOR SINHALA LANGUAGE
Authors:	Jayaweera, A.J.P.M.P.
Issue Date:	2015
Citation:	Jayaweera, A.J.P.M.P.(2015). PART OF SPEECH TAGGER FOR SINHALA LANGUAGE.M.Phil.Thesis, University of kelaniya.
Series/Report no.:	TH;1370
Abstract:	This dissertation presents a stochastic based Part of Speech tagging method for Sinhala language. Part of Speech (PaS) is a very vital topic in any Natural Language processing task that involves analyzing the construction of the language. behavior of the language and the dynamics of the language. This knowledge could be utilized in computational linguistics analysis and automation applications. The motivation behind the research was to fulfill the gaps which are existed at present in the research area of Natural Language Processing (NLP) and analysis of Sinhala language and giving a push to computational linguistics analysis of Natural Language processing research in Sinhala language. Though Sinhala is a morphologically rich language, in which words arc inflected with various grammatical features, tagging is very essential for further analysis of the language. Our research is based on a statistical approach,in which the tagging process is done by computing the tag sequence probability and the word-likelihood probability from the given corpus,where the linguistic knowledge is automatically extracted from the annotated text. Our effort was mainly focused on designing an architecture for the tagger and development of the tagger. The implementation of the tagger was based on a wellknown stochastic model, known as I-lidden Markov Model (HMM). The distinction between open class and closed class word categories together with syntactical features of the language were used to predict lexical categories of unknown words. Simple Good-Turing algorithm and Witten-Bell discounting methods were used to resolve spare data issues. The evaluation of the tagger was done by using the corpora and the tag set developed by the University of Colombo School of Computing (UCSC) in year 2005 under the PAN Localization Project. The model was tested against 90551 words. and 2754 sentences of Sinhala text corpus and the tagger could reach over 90% accuracy in the tagging process which shows a considerable success over previous works reported in 2004 and 2013. In 2004. a Hidden Markov Model based Part of Speech tagger was proposed using bigram model and reported only 60% of accuracy and in 2013 another Hidden Markov Model based approach was tried out and reported around 62% of accuracy. However. the overall accuracy of the tagger we implemented have shown more than 90%. a set of improvements arc suggested in this dissertation mainly in the area of handling unknown words. Eventhough these other research were carried out for Sinhala language,they are not available to use as tools for further language analysis of Sinhala language. So as an additional product of this work we have make the tagger that we implemented available as an on-line interface on web freely accessible to the public.
URI:	http://repository.kln.ac.lk/handle/123456789/17513
Appears in Collections:	MPhil.Theses MPhil / PhD Theses

Files in This Item:

File	Description	Size	Format
TH1370.pdf		484.29 kB	Adobe PDF	View/Open

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets