Tooling Up for Digital Humanities

  • Development Blog
  • Documentation
  • Plugins
  • Suggest Ideas
  • Support Forum
  • Themes
  • WordPress Planet
  • Home
  • Workshop Series
  • About
  • Virtual You
    • 1: Virtual You
    • 2: Keeping a Finger on the Pulse
    • 3: Building Community
    • 4: Further Reading
    • 5: Discussion
  • Digitization
    • 1: Making Documents Digital
    • 2: Metadata and Text Markup
    • 3: Further Reading
    • 4: Discussion
  • Text Analysis
    • 1: The Text Deluge
    • 2: A Brief History
    • 3: Stylometry
    • 4: Content-Based Analysis
    • 5: Metadata Analysis
    • 6: Conclusion
    • 7: Further Reading
    • 8: Discussion
  • Spatial Analysis
    • 1: The Spatial Turn
    • 2: Spatial History Lab
    • 3: Geographic Information Systems
    • 4: Further Reading
    • 5: Discussion
  • Databases
    • 1: The Basics
    • 2: Managing Your Bibliography
    • 3: Cloud Computing
    • 4: Organizing Images
    • 5: Further Reading
    • 6: Discussion
  • Pedagogy
    • 1: In the Classroom
    • 2: Student Collaboration
    • 3: Debating Pedagogical Efficacy
    • 4: Further Reading
    • 5: Discussion
  • Data Visualization
    • 1: Introduction
    • 2: Getting Started
    • 3: For Analysis and Understanding
    • 4: For Communication and Storytelling
    • 5: Visualizations and Accountability
    • 6: Recommended Reading/Viewing
    • 7: Discussion
  • Discussion

1: Making Documents Digital

As archives increasingly go digital, what do humanities scholars need to know in order to optimize their use of these new resources?

First and foremost, the format of those resources matters. A digital archive represents the work and decisions of many librarians and archivists, and understanding those decisions and the forms they take can help researchers access and manipulate data more effectively. Certain types of digital analysis can only be done on data that exists in certain formats, for instance. Although scholars need not understand the digital archiving process in enormous depth, they must understand on a basic level how a particular online archive is structured.

The first step in the archival process of transforming a source from an analog (non-digital) format to an electronic format usually takes the form of scanning the document in order to create a digitized image of the text (a PDF, for instance). This initial imprint is like taking a photograph, even if the imprint is of a page of text.  A great deal of archival material is available in this “images of words” format. Although useful for someone who wants to read read the document, this image alone does little to help researchers find, access, or manipulate that text.

Documents become more usable through a process called “extraction,” in which computer software performs Optical Character Recognition (OCR). This process scans the image of text and attempts to recognize characters and words, which it then stores as a separate layer of text based on the scanned image. This layer can be thought of as being superimposed on the image – while the human eye understands a picture of the word “Uzbekistan,” the computer uses the translated, invisible layer to recognize “Uzbekistan” as a series of characters that form a word.

This process is largely automated, although the software can be checked and supplemented by a human reader. A fully-automated process is commonly known as a “dirty OCR” because it will often have many typos and errors (letter combinations like “rn” might be transcribed as “m,” for instance.). The initial OCR text translation is usually around 70 or 80 percent accurate, and often needs human review. It is important to recognize that when performing searches within OCR’d documents the search is not necessarily comprehensive. If a word was not properly recognized or translated by the process, a computer search will not find it (even if a human reader will correctly recognize the word). From here, the image and “dirty OCR” can be reviewed by a human reader to improve its translation accuracy.

Let's dive right in! 2: Metadata and Text Markup
Comments
  • Cuauhtémoc García-García:

    April 8, 2011: Digitization & Archives
    Host: Glen Worthey
    I really liked Glen Worthey’s seminar. What interested me the most from his talk was the enormous potential that we have to address questions in the humanities using hundreds of thousands of digitized books, changing the concept from working with a well defined object “a book” to a collection of objects “a corpus of books”. In addition, it allows us to quantify the usage of “words” : using a corpus of books, we can ask about the relevance of certain words, word patterns, word uses across different periods of time or languages, etc. This is an invaluable tool that provides us with “writing tendencies” at different times of human history. As commented during the seminar, ngram viewer by google provides this service using million(s) of digitized books in seven languages (and several categories of English: American, British, Fiction, etc). Googles’ ngram is the first giant step towards the use of digitized archives to investigate the use of words in human history.

    June 3, 2011 at 6:04 pm

Navigation

  • Welcome
  • Workshop Series
  • About
  • Virtual You
  • Digitization
    • 1: Making Documents Digital
    • 2: Metadata and Text Markup
    • 3: Further Reading
    • 4: Discussion
  • Text Analysis
  • Spatial Analysis
  • Databases
  • Pedagogy
  • Data Visualization
  • Discussion
Powered by WordPress | “Blend” from Spectacu.la WP Themes Club