1: Making Documents Digital
Documents become more usable through a process called “extraction,” in which computer software performs Optical Character Recognition (OCR). This process scans the image of text and attempts to recognize characters and words, which it then stores as a separate layer of text based on the scanned image. This layer can be thought of as being superimposed on the image – while the human eye understands a picture of the word “Uzbekistan,” the computer uses the translated, invisible layer to recognize “Uzbekistan” as a series of characters that form a word.
This process is largely automated, although the software can be checked and supplemented by a human reader. A fully-automated process is commonly known as a “dirty OCR” because it will often have many typos and errors (letter combinations like “rn” might be transcribed as “m,” for instance.). The initial OCR text translation is usually around 70 or 80 percent accurate, and often needs human review. It is important to recognize that when performing searches within OCR’d documents the search is not necessarily comprehensive. If a word was not properly recognized or translated by the process, a computer search will not find it (even if a human reader will correctly recognize the word). From here, the image and “dirty OCR” can be reviewed by a human reader to improve its translation accuracy.