Development of Lekha OCR

Event Date: 
2016-03-31T00:00:00

Christened with the indigenous name Lekha (meaning “writing”), and developed with funding from ICFOSS, Lekha OCR is a Malayalam Optical Character Recognizer (OCR) developed by SPACE. This software converts scanned images of Malayalam text into editable format.

Lekha OCR & the Climb to 85 per cent Accuracy: Lekha OCR was developed between June 2015 and March 2016. To start with, the team studied the features of Malayalam characters. They also identified common Malayalam fonts used in publications such as DC Books and Malayala Manorama. From this, they created samples of Malayalam letters. Initially, accuracy was poor (40 per cent), and so the team focused on making the dataset bigger by adding more fonts. Accuracy then surged to a healthy 75 per cent, on par with Google Tesseract’s at 75-76 per cent.

The team pushed further, brainstorming on how to improve Lekha OCR even more. They tried combining different features, and as a result, Lekha OCR’s accuracy in identifying text characters now stands at 85 per cent. Phase 2 of this project is currently underway, and the focus is on language properties, layout and interface.

Categories: 

PARTNERS