Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

A new generation of textual corpora: mining corpora from very large collections

A new generation of textual corpora: mining corpora from very large collections A New Generation of Textual Corpora Mining Corpora from Very Large Collections Gordon Stewart Harvard University Language Resource Center Cambridge, MA 02138 Gregory Crane Tufts University Perseus Project Medford, MA 02155 Alison Babeu Tufts University Perseus Project Medford, MA 02155 stewart5@fas.harvard.edu gregory.crane@tufts.edu ABSTRACT While digital libraries based on page images and automatically generated text have made possible massive projects such as the Million Book Library, Open Content Alliance, Google, and others, humanists still depend upon textual corpora expensively produced with labor-intensive methods such as double-keyboarding and manual correction. This paper reports the results from an analysis of OCR-generated text for classical Greek source texts. Classicists have depended upon specialized manual keyboarding that costs two or more times as much as keyboarding of English both for accuracy and because classical Greek OCR produced no usable results. We found that we could produce texts by OCR that, in some cases, approached the 99.95% professional data entry accuracy rate. In most cases, OCR-generated text yielded results that, by including the variant readings that digital corpora traditionally have left out, provide better recall and, we argue, can better serve many scholarly needs than the expensive corpora upon which classicists have relied for http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png

A new generation of textual corpora: mining corpora from very large collections

Association for Computing Machinery — Jun 18, 2007

Loading next page...
/lp/association-for-computing-machinery/a-new-generation-of-textual-corpora-mining-corpora-from-very-large-yWV7AfGF2P

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Datasource
Association for Computing Machinery
Copyright
Copyright © 2007 by ACM Inc.
ISBN
978-1-59593-644-8
doi
10.1145/1255175.1255247
Publisher site
See Article on Publisher Site

Abstract

A New Generation of Textual Corpora Mining Corpora from Very Large Collections Gordon Stewart Harvard University Language Resource Center Cambridge, MA 02138 Gregory Crane Tufts University Perseus Project Medford, MA 02155 Alison Babeu Tufts University Perseus Project Medford, MA 02155 stewart5@fas.harvard.edu gregory.crane@tufts.edu ABSTRACT While digital libraries based on page images and automatically generated text have made possible massive projects such as the Million Book Library, Open Content Alliance, Google, and others, humanists still depend upon textual corpora expensively produced with labor-intensive methods such as double-keyboarding and manual correction. This paper reports the results from an analysis of OCR-generated text for classical Greek source texts. Classicists have depended upon specialized manual keyboarding that costs two or more times as much as keyboarding of English both for accuracy and because classical Greek OCR produced no usable results. We found that we could produce texts by OCR that, in some cases, approached the 99.95% professional data entry accuracy rate. In most cases, OCR-generated text yielded results that, by including the variant readings that digital corpora traditionally have left out, provide better recall and, we argue, can better serve many scholarly needs than the expensive corpora upon which classicists have relied for

There are no references for this article.