Sinhala Character Recognition using Tesseract OCR

Manisha, U.K.D.N.; Liyanage, S.R.

Please use this identifier to cite or link to this item: http://repository.kln.ac.lk/handle/123456789/18983

Title:	Sinhala Character Recognition using Tesseract OCR
Authors:	Manisha, U.K.D.N. Liyanage, S.R.
Keywords:	Optical Character Recognition Computer Vision Image Processing Image Segmentation
Issue Date:	2018
Publisher:	3rd International Conference on Advances in Computing and Technology (ICACT ‒ 2018), Faculty of Computing and Technology, University of Kelaniya, Sri Lanka.
Citation:	Manisha, U.K.D.N. and Liyanage, S.R. (2018). Sinhala Character Recognition using Tesseract OCR. 3rd International Conference on Advances in Computing and Technology (ICACT ‒ 2018), Faculty of Computing and Technology, University of Kelaniya, Sri Lanka. p12.
Abstract:	In Sri Lanka, there are many fields that uses Sinhala scripts, such as newspaper editors, writers, postal and application processes. In these fields there have only a scanned or printed copies of Sinhala script, where they have to enter them manually to a computerized system, which consumes much time and cost. The proposed method was consisted of two areas as image pre-processing and training the OCR classifier. In Image pre-processing, the scanned images were enhanced and binarized using image processing techniques such as grayscale conversion and binarization using local thresholding. In order to avoid distortions of scanned images such as water marks and highlights was removed through the grayscale conversion with color intensity changes. In the OCR training, the Tesseract OCR engine was used to create the Sinhala language data file and used the data file with a customized function to detect Sinhala characters in scanned documents. OCR engine was primarily used to create a language data file. First, pre-processed images were segmented (white letters in black background) using local adaptive thresholding where performing Otsu’s thresholding algorithm to separate the text from the background. Then page layout analysis was performed to identify non-text areas such as images, as well as multi-column text into columns. Then used detections of baselines and words by using blob analysis where each blob was sorted using the x-coordinate (left edge of the blob) as the sort key which enables to track the skew across the page. After the separation of each character, then labeled manually into Sinhala language characters. By using the Sinhala language data file into OCR function, it returns the recognized text, the recognition confidence, and the location of the text in the original image. By considering the recognition confidence of each word it is possible to control the accuracy of the system. The classifier was trained using 40 characters sets with 20 images from each character and tested using over 1000 characters (200 words) with variations of font sizes and achieved approximately 97% of accuracy. The elapsed time was less than 0.05 per a line with more than 20 words, which was a higher improvement than a manual data entering. Since the classifier can be retrained using testing images, it can be developed to achieve active learning.
URI:	http://repository.kln.ac.lk/handle/123456789/18983
Appears in Collections:	ICACT 2018

Files in This Item:

File	Description	Size	Format
12.pdf		280.59 kB	Adobe PDF	View/Open

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets