1: Making Documents Digital
Documents become more usable through a process called “extraction,” in which computer software performs Optical Character Recognition (OCR). This process scans the image of text and attempts to recognize characters and words, which it then stores as a separate layer of text based on the scanned image. This layer can be thought of as being superimposed on the image – while the human eye understands a picture of the word “Uzbekistan,” the computer uses the translated, invisible layer to recognize “Uzbekistan” as a series of characters that form a word.
This process is largely automated, although the software can be checked and supplemented by a human reader. A fully-automated process is commonly known as a “dirty OCR” because it will often have many typos and errors (letter combinations like “rn” might be transcribed as “m,” for instance.). The initial OCR text translation is usually around 70 or 80 percent accurate, and often needs human review. It is important to recognize that when performing searches within OCR’d documents the search is not necessarily comprehensive. If a word was not properly recognized or translated by the process, a computer search will not find it (even if a human reader will correctly recognize the word). From here, the image and “dirty OCR” can be reviewed by a human reader to improve its translation accuracy.
April 8, 2011: Digitization & Archives
Host: Glen Worthey
I really liked Glen Worthey’s seminar. What interested me the most from his talk was the enormous potential that we have to address questions in the humanities using hundreds of thousands of digitized books, changing the concept from working with a well defined object “a book” to a collection of objects “a corpus of books”. In addition, it allows us to quantify the usage of “words” : using a corpus of books, we can ask about the relevance of certain words, word patterns, word uses across different periods of time or languages, etc. This is an invaluable tool that provides us with “writing tendencies” at different times of human history. As commented during the seminar, ngram viewer by google provides this service using million(s) of digitized books in seven languages (and several categories of English: American, British, Fiction, etc). Googles’ ngram is the first giant step towards the use of digitized archives to investigate the use of words in human history.