|
|||
| Khmer Optical Character Recognition Initiative |
|
|
|
| Written by Administrator | ||
| Saturday, 24 October 2009 02:59 | ||
|
The Khmer OCR project has started since July 2008. With full training, two engines of PLC's OCR system were created. They can recognize the scanned documents with good quality in normal format with no table, picture, column, bold, italic and underline printed in either Limon S1 or Limon R1 with size 22. As a result, the final outputs are in ASCII text which can be easily converted into Khmer Unicode with the implementation of the existing Conversion library, one of the libraries developed during Phase I of PAN Localization projects. The outputs can also be saved into two document formats: Microsoft Word 2003 (*.doc) and Text document (*.txt).
The OCR to Cambodian developers is the technology which is novel and challenging. It requires the developers to have a great knowledge of digital image processing and programming talents. Moreover, the complexity of Khmer script calls for direct involvement of Khmer native developers because they are rather aware of the way Khmer words are formed. Beside such challenges, the OCR itself is an intricate project to be developed within a short period of time in the way that it requires intensive analysis on the nature of the source documents. Working on documents which are old and degraded involves both rigorous study and advanced technology since they contain noises -the components which must be eliminated, or otherwise it can be puzzling for the OCR application; texts that are opaque; sentences which are skewed; words that are formatted such as being bold, italic, underline, strikethrough, etc.; so on and so forth. Nowadays, the demand for the application to be put into use is very high in every aspect of digital content compilation entity. One instance is the establishment of e-library which requires typing workloads from human if OCR is unavailable. With the present of Khmer OCR, the work can speed up and the cost is also reduced. Other cases such as in the Extraordinary Chambers in the Courts of Cambodia (ECCC), Khmer OCR is very useful for the judges and other related entities; for example, imagine how difficult it is to find the "Pol Pot" name in piles of document. How about let OCR application do the task, turning all documents searchable? That saves a lot of valuable time. Furthermore, within many government institutions, the need for digitizing old documents is currently taken into account, and that what Khmer OCR can do as well.
|
||
PAN Cambodia

Content : 7
Web Links : 1
Content View Hits : 3355









