Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You and Your Team.

Learn More →

Open source optical character recognition for historical research

Open source optical character recognition for historical research Purpose – This paper aims to present an evaluation of open source OCR for supporting research on material in small‐ to medium‐scale historical archives. Design/methodology/approach – The approach was to develop a workflow engine to support the easy customisation of the OCR process towards the historical materials using open source technologies. Commercial OCR often fails to deliver sufficient results here, as their processing is optimised towards large‐scale commercially relevant collections. The approach presented here allows users to combine the most effective parts of different OCR tools. Findings – The authors demonstrate their application and its flexibility and present two case studies, which demonstrate how OCR can be embedded into wider digitally enabled historical research. The first case study produces high‐quality research‐oriented digitisation outputs, utilizing services that the authors developed to allow for the direct linkage of digitisation image and OCR output. The second case study demonstrates what becomes possible if OCR can be customised directly within a larger research infrastructure for history. In such a scenario, further semantics can be added easily to the workflow, enhancing the research browse experience significantly. Originality/value – There has been little work on the use of open source OCR technologies for historical research. This paper demonstrates that the authors' workflow approach allows users to combine commercial engines' ability to read a wider range of character sets with the flexibility of open source tools in terms of customisable pre‐processing and layout analysis. All this can be done without the need to develop dedicated code. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of Documentation Emerald Publishing

Open source optical character recognition for historical research

Journal of Documentation , Volume 68 (5): 25 – Aug 31, 2012

Loading next page...
 
/lp/emerald-publishing/open-source-optical-character-recognition-for-historical-research-3lBfDAdQPE
Publisher
Emerald Publishing
Copyright
Copyright © 2012 Emerald Group Publishing Limited. All rights reserved.
ISSN
0022-0418
DOI
10.1108/00220411211256021
Publisher site
See Article on Publisher Site

Abstract

Purpose – This paper aims to present an evaluation of open source OCR for supporting research on material in small‐ to medium‐scale historical archives. Design/methodology/approach – The approach was to develop a workflow engine to support the easy customisation of the OCR process towards the historical materials using open source technologies. Commercial OCR often fails to deliver sufficient results here, as their processing is optimised towards large‐scale commercially relevant collections. The approach presented here allows users to combine the most effective parts of different OCR tools. Findings – The authors demonstrate their application and its flexibility and present two case studies, which demonstrate how OCR can be embedded into wider digitally enabled historical research. The first case study produces high‐quality research‐oriented digitisation outputs, utilizing services that the authors developed to allow for the direct linkage of digitisation image and OCR output. The second case study demonstrates what becomes possible if OCR can be customised directly within a larger research infrastructure for history. In such a scenario, further semantics can be added easily to the workflow, enhancing the research browse experience significantly. Originality/value – There has been little work on the use of open source OCR technologies for historical research. This paper demonstrates that the authors' workflow approach allows users to combine commercial engines' ability to read a wider range of character sets with the flexibility of open source tools in terms of customisable pre‐processing and layout analysis. All this can be done without the need to develop dedicated code.

Journal

Journal of DocumentationEmerald Publishing

Published: Aug 31, 2012

Keywords: Digital history; Optical character recognition; Open source; Workflows; Historical collections; Archives; Digital libraries

References