The use of OCR software to convert printed characters to digital text is a fundamental tool within diachronic approaches to Corpus- assisted discourse Studies because allow researchers to expand their inter- est by making many texts available and analysable through a computer. However, OCR software are not totally accurate, and the resulting er- ror rate compromises their effectiveness. This paper proposes a mixed qualitative-quantitative approach to OCR error detection and correction in order to develop a methodology for compiling historical corpora. The proposed approach consists of three main steps: corpus creation, OCR detection and correction, and application of the automatic rules. The rules are implemented in R using a “tidyverse” approach for a better reproducibility of the experiments.

A Quantitative/Qualitative Approach to OCR Error Detection and Correction in Old Newspapers for Corpus-assisted Discourse Studies

Del Fante
Primo
;
2021

Abstract

The use of OCR software to convert printed characters to digital text is a fundamental tool within diachronic approaches to Corpus- assisted discourse Studies because allow researchers to expand their inter- est by making many texts available and analysable through a computer. However, OCR software are not totally accurate, and the resulting er- ror rate compromises their effectiveness. This paper proposes a mixed qualitative-quantitative approach to OCR error detection and correction in order to develop a methodology for compiling historical corpora. The proposed approach consists of three main steps: corpus creation, OCR detection and correction, and application of the automatic rules. The rules are implemented in R using a “tidyverse” approach for a better reproducibility of the experiments.
2021
Corpus-assisted Discourse Studies, OCR detection, OCR correction
File in questo prodotto:
File Dimensione Formato  
paper5.pdf

accesso aperto

Descrizione: versione editoriale
Tipologia: Full text (versione editoriale)
Licenza: Creative commons
Dimensione 395.77 kB
Formato Adobe PDF
395.77 kB Adobe PDF Visualizza/Apri

I documenti in SFERA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11392/2500963
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact