Evaluation and refinement of an enhanced OCR-process for mass digitisation
Great expectations are placed on the capacity of heritage institutions to make their collections available in digital format. Data driven research is becoming a key concept within the humanities and social sciences. Kungliga biblioteke s (KB, National Library of Sweden) collections of digitised newspaper can thus be regarded as unique cultural data sets with information that rarely is conveyed in other media types. The digital format makes it possible to explore these resources in ways not feasible while in print format.
As texts are no longer only read but also subjected to computer based analysis the demand on the reliability increases. Technologies for converting images to machine-readable text – OCR – play a fundamental part in making these resources available, but the effectiveness vary with the type of document being processed. This is evident in relation to the digitisation of newspapers where factors relating to their production, layout and paper quality often impair the OCR-production. In order to improve the machine readable text, especially in relation to the digitisation of newspapers, KB initiated the development of an OCR-module where key parameters can be adjusted according to the characteristics of the material being processed. The purpose of this project application is to carry out a formal evaluation of, and improve this OCR-module through systematic text analyses, dictionaries and word lists with the aim of implementing it in the mass digitisation process.
Final report
Purpose of the project
The purpose of the project was to fine-tune and evaluate a test platform for OCR-production (referred to as the OCR-module). The module was developed by Kungliga biblioteket (KB) in cooperation with the Norwegian software company Zissor and is based on two programs for OCR-production: ABBYY FineReader version 11.1.16 and Tesseract version 4.0. The underlying principle for the module is the comparison on word-level of the results from the programs. If these two results match, the choosen word will be given a high degree of correctness. In a case where the words don't match they will be further processed according to a pre-defined scheme where different word-candidates are compared and weighted. The final decision will be the alternative with the highest score. The design of the OCR-module also enables adjustments and control of some key parameters in the post-capture part of the OCR-process (e.g. custom made word-lists, linguistic algorithms and refinements based on typographical features) in order to match specific features of the newspaper as a printed product in a historical perspective, where linguistic conventions and layout changes over time.
The results of the project
To enable the assignment as defined in the purpose of the project we prepared a gold standard, an error free digital reference material, containing of 402 pages taken from newspapers spanning from 1818 to 2018 – the period that was set for the project. First we manually selected 201 digitised newspapers, one from each year in the period. These newspapers were carefully chosen to reflect typical variations in layout and typography. Secondly, from each newspaper two pages were chosen. Thirdly, the image file of each page was segmented down to paragraph level and each paragraph, 43613 in total for the complete reference material, was marked with an ID number. Finally, all pages were delivered to an agency for manual transcription. This process consisted of a double-keying procedure where each page was transcribed by two annotators. We also performed manual annotations on document and paragraph levels. We hired two undergraduate students to classify all pages and paragraphs according to pre-defined attributes and the results were noted in an excel-document containing all paragraph-ID:s.
Quantitative evaluation
We run the first evaluation on the complete material to get a baseline score. This evaluation was run on each OCR system separately and on the OCR-module. No external word lists were used. The results showed that, on a character level, the OCR-module performs better than each of the separate OCR-systems, on a word level we did not see an improvement compared to Tesseract.
We then performed several systematic evaluations on specific time periods and examined the evaluation results for each of them separately. Firstly, we tried to improve upon our baseline by combining the OCR-module and the separate OCR systems with external word lists from three different time periods and with a word list of name entities. We first applied a rather naive approach by simply comparing word strings (the lexical word and the erroneous word), this led to only minor improvement on the word level. A second approach we tried was to calculate the similarity score between the word strings and replace the erroneous word with the lowest scored word. Even though this approach was much more successful it was computationally expensive in a sense that word combinations took long time to process. The evaluation of the second approach showed that words written in Blackletter have, as expected, a lower correctness than ordinary text. The proportion of the words that are not found in the dictionaries was also higher for these words.
We found that the OCR-module offers better opportunities for quality assessment of the correctness of the processed text because it measures how many words on a given page have been interpreted equally. This indication is thus not dependent on the OCR programs' own confidence assessment and can therefore be said to constitute a more useful measure of the text's compliance with the processed source document.
Qualitative evaluation
Linguistic and typographic analysis of the data was done manually by examining how the layout, paper characteristics (e.g. degradation and discolouration) and printing quality affected the correctness of the OCR production. The segmentation of the reference material down to paragraph level allowed for a high level of detail in this part of the investigation. A factor that affects the correctness of the OCR-processing of newspapers is that the text in most cases is arranged in columns. The margins between such columns can be quite narrow and irregular and the OCR-programs therefore sometimes treat lines of text as related although they belong to adjacent columns. The analysis showed that the majority of newspapers from 1818 to 1837 contain only one column, and at most three columns (10% of the material). 60% of the newspapers from 1838 until 1857 contain three columns. From 1858 until 1997, there seems to be more variation. Overall, the most common number of columns is 3, which appears in 25% of the whole material. On the paragraph level, the majority of paragraphs contain only one column, but in some newspapers we find segments containing several tables or lists formatted as individual columns within a single paragraph. The quality of the printing is low in the majority of newspapers, but improves from around 2000. Photographic images hardly exist in the material up until 1898. From 1938 and onward, almost all newspapers contain photographic images.
Most skewed paragraphs, where the alignment of text lines and columns deviates horizontally and vertically, and thus affects the correctness of the OCR, appear between 1818 and 1857, with approximately 13%. From 1858 and onward skewed paragraphs affect only 1% of the material. Throughout the whole material we see an occurrence of lists and tables in approximately 1% of the segments, a typographical feature that also affects the quality of OCR processing. Images comprise around 2% of the paragraphs up to 1937. From 1938 we find images in 5% of the paragraphs on average. In many instances images and graphic elements are treated as text in the OCR process thus adding yet another source of error.
Conclusions
The project has demonstrated the possible benefits from applying a system for OCR-processing where the comparison on word level between the results from two different OCR-programs is used as a method for improving the result.
Our analysis showed that the OCR programs' application of external dictionaries is difficult to assess because commercial software manufacturers are, for understandable reasons, reluctant to share their documentation about which principles they apply for choosing the correct word. There is room for further investigations into the use of custom-made word lists and authority records to support OCR-processing of material from specific genres or historical periods.
The project further generated a structured understanding of factors in the source material (e.g. the physical characteristics of the newspaper, the appearance of images, layout, poor printing quality etc) that might affect the correctness in the OCR-production.
Finally, it was demonstrated that the OCR-module offers a more reliable measure of correctness of the text as it is based on a comparison on word level between the output of the two OCR-programs. If both results are the same there are ground for assuming a higher level of correctness than in a case where the results differ. The study demonstrated that the individual confidence values of the OCR-programs not is a fully reliable indication of quality. The principle that has been investigated in this project could therefore be used as a kind of "content declaration" in relation to machine readable texts, either on page level or for an entire document.
The project has generated freely available reference material consisting of manually transcribed and annotated resources, that can be used in further analysis, training and improvement of Swedish OCR-models. The approach, methods and results have been disseminated throughout the project. The project has generated nine publications of which five were presented in international conferences, two are Master's thesis projects and one is currently under review.
The OCR-module is yet not implemented in the mass-digitisation process at KB as there has to be further adjustments regarding the technical infrastructure for the digital production. We have established contact with libraries in the Nordic countries, including National Library of Norway, National Library of Finland, National and University Library of Iceland, and Royal Danish Library with which we are working on an infrastructure research initiative. We will, within the framework of cooperation between KB and SpråkbankenText/Swe-Clarin, continue our experiments and analysis with the goal of improving OCR-production in the Swedish language and the results will be disseminated through the CLARIN ERIC network.
New research issues that have been generated through the project
Apart from the results discussed above the findings in the project also include several new questions that would benefit from further research.
1. What role does the strengths and weaknesses of the various OCR-programs during the process of conversion?
2. What lexical resources, apart from those used in the project, might help improve the OCR results?
3. Can the OCR correctness be improved by strengthening the relation between the qualitative and the quantitative analysis?
4. Which methods are useful for separating typographical information from non-informative elements in text-based documents with complex layout?
5. Which methods and benchmarks should we use to support the assessment of the accuracy of an OCR-processed text?
The purpose of the project was to fine-tune and evaluate a test platform for OCR-production (referred to as the OCR-module). The module was developed by Kungliga biblioteket (KB) in cooperation with the Norwegian software company Zissor and is based on two programs for OCR-production: ABBYY FineReader version 11.1.16 and Tesseract version 4.0. The underlying principle for the module is the comparison on word-level of the results from the programs. If these two results match, the choosen word will be given a high degree of correctness. In a case where the words don't match they will be further processed according to a pre-defined scheme where different word-candidates are compared and weighted. The final decision will be the alternative with the highest score. The design of the OCR-module also enables adjustments and control of some key parameters in the post-capture part of the OCR-process (e.g. custom made word-lists, linguistic algorithms and refinements based on typographical features) in order to match specific features of the newspaper as a printed product in a historical perspective, where linguistic conventions and layout changes over time.
The results of the project
To enable the assignment as defined in the purpose of the project we prepared a gold standard, an error free digital reference material, containing of 402 pages taken from newspapers spanning from 1818 to 2018 – the period that was set for the project. First we manually selected 201 digitised newspapers, one from each year in the period. These newspapers were carefully chosen to reflect typical variations in layout and typography. Secondly, from each newspaper two pages were chosen. Thirdly, the image file of each page was segmented down to paragraph level and each paragraph, 43613 in total for the complete reference material, was marked with an ID number. Finally, all pages were delivered to an agency for manual transcription. This process consisted of a double-keying procedure where each page was transcribed by two annotators. We also performed manual annotations on document and paragraph levels. We hired two undergraduate students to classify all pages and paragraphs according to pre-defined attributes and the results were noted in an excel-document containing all paragraph-ID:s.
Quantitative evaluation
We run the first evaluation on the complete material to get a baseline score. This evaluation was run on each OCR system separately and on the OCR-module. No external word lists were used. The results showed that, on a character level, the OCR-module performs better than each of the separate OCR-systems, on a word level we did not see an improvement compared to Tesseract.
We then performed several systematic evaluations on specific time periods and examined the evaluation results for each of them separately. Firstly, we tried to improve upon our baseline by combining the OCR-module and the separate OCR systems with external word lists from three different time periods and with a word list of name entities. We first applied a rather naive approach by simply comparing word strings (the lexical word and the erroneous word), this led to only minor improvement on the word level. A second approach we tried was to calculate the similarity score between the word strings and replace the erroneous word with the lowest scored word. Even though this approach was much more successful it was computationally expensive in a sense that word combinations took long time to process. The evaluation of the second approach showed that words written in Blackletter have, as expected, a lower correctness than ordinary text. The proportion of the words that are not found in the dictionaries was also higher for these words.
We found that the OCR-module offers better opportunities for quality assessment of the correctness of the processed text because it measures how many words on a given page have been interpreted equally. This indication is thus not dependent on the OCR programs' own confidence assessment and can therefore be said to constitute a more useful measure of the text's compliance with the processed source document.
Qualitative evaluation
Linguistic and typographic analysis of the data was done manually by examining how the layout, paper characteristics (e.g. degradation and discolouration) and printing quality affected the correctness of the OCR production. The segmentation of the reference material down to paragraph level allowed for a high level of detail in this part of the investigation. A factor that affects the correctness of the OCR-processing of newspapers is that the text in most cases is arranged in columns. The margins between such columns can be quite narrow and irregular and the OCR-programs therefore sometimes treat lines of text as related although they belong to adjacent columns. The analysis showed that the majority of newspapers from 1818 to 1837 contain only one column, and at most three columns (10% of the material). 60% of the newspapers from 1838 until 1857 contain three columns. From 1858 until 1997, there seems to be more variation. Overall, the most common number of columns is 3, which appears in 25% of the whole material. On the paragraph level, the majority of paragraphs contain only one column, but in some newspapers we find segments containing several tables or lists formatted as individual columns within a single paragraph. The quality of the printing is low in the majority of newspapers, but improves from around 2000. Photographic images hardly exist in the material up until 1898. From 1938 and onward, almost all newspapers contain photographic images.
Most skewed paragraphs, where the alignment of text lines and columns deviates horizontally and vertically, and thus affects the correctness of the OCR, appear between 1818 and 1857, with approximately 13%. From 1858 and onward skewed paragraphs affect only 1% of the material. Throughout the whole material we see an occurrence of lists and tables in approximately 1% of the segments, a typographical feature that also affects the quality of OCR processing. Images comprise around 2% of the paragraphs up to 1937. From 1938 we find images in 5% of the paragraphs on average. In many instances images and graphic elements are treated as text in the OCR process thus adding yet another source of error.
Conclusions
The project has demonstrated the possible benefits from applying a system for OCR-processing where the comparison on word level between the results from two different OCR-programs is used as a method for improving the result.
Our analysis showed that the OCR programs' application of external dictionaries is difficult to assess because commercial software manufacturers are, for understandable reasons, reluctant to share their documentation about which principles they apply for choosing the correct word. There is room for further investigations into the use of custom-made word lists and authority records to support OCR-processing of material from specific genres or historical periods.
The project further generated a structured understanding of factors in the source material (e.g. the physical characteristics of the newspaper, the appearance of images, layout, poor printing quality etc) that might affect the correctness in the OCR-production.
Finally, it was demonstrated that the OCR-module offers a more reliable measure of correctness of the text as it is based on a comparison on word level between the output of the two OCR-programs. If both results are the same there are ground for assuming a higher level of correctness than in a case where the results differ. The study demonstrated that the individual confidence values of the OCR-programs not is a fully reliable indication of quality. The principle that has been investigated in this project could therefore be used as a kind of "content declaration" in relation to machine readable texts, either on page level or for an entire document.
The project has generated freely available reference material consisting of manually transcribed and annotated resources, that can be used in further analysis, training and improvement of Swedish OCR-models. The approach, methods and results have been disseminated throughout the project. The project has generated nine publications of which five were presented in international conferences, two are Master's thesis projects and one is currently under review.
The OCR-module is yet not implemented in the mass-digitisation process at KB as there has to be further adjustments regarding the technical infrastructure for the digital production. We have established contact with libraries in the Nordic countries, including National Library of Norway, National Library of Finland, National and University Library of Iceland, and Royal Danish Library with which we are working on an infrastructure research initiative. We will, within the framework of cooperation between KB and SpråkbankenText/Swe-Clarin, continue our experiments and analysis with the goal of improving OCR-production in the Swedish language and the results will be disseminated through the CLARIN ERIC network.
New research issues that have been generated through the project
Apart from the results discussed above the findings in the project also include several new questions that would benefit from further research.
1. What role does the strengths and weaknesses of the various OCR-programs during the process of conversion?
2. What lexical resources, apart from those used in the project, might help improve the OCR results?
3. Can the OCR correctness be improved by strengthening the relation between the qualitative and the quantitative analysis?
4. Which methods are useful for separating typographical information from non-informative elements in text-based documents with complex layout?
5. Which methods and benchmarks should we use to support the assessment of the accuracy of an OCR-processed text?