Tooling Up for Digital Humanities

  • Development Blog
  • Documentation
  • Plugins
  • Suggest Ideas
  • Support Forum
  • Themes
  • WordPress Planet
  • Home
  • Workshop Series
  • About
  • Virtual You
    • 1: Virtual You
    • 2: Keeping a Finger on the Pulse
    • 3: Building Community
    • 4: Further Reading
    • 5: Discussion
  • Digitization
    • 1: Making Documents Digital
    • 2: Metadata and Text Markup
    • 3: Further Reading
    • 4: Discussion
  • Text Analysis
    • 1: The Text Deluge
    • 2: A Brief History
    • 3: Stylometry
    • 4: Content-Based Analysis
    • 5: Metadata Analysis
    • 6: Conclusion
    • 7: Further Reading
    • 8: Discussion
  • Spatial Analysis
    • 1: The Spatial Turn
    • 2: Spatial History Lab
    • 3: Geographic Information Systems
    • 4: Further Reading
    • 5: Discussion
  • Databases
    • 1: The Basics
    • 2: Managing Your Bibliography
    • 3: Cloud Computing
    • 4: Organizing Images
    • 5: Further Reading
    • 6: Discussion
  • Pedagogy
    • 1: In the Classroom
    • 2: Student Collaboration
    • 3: Debating Pedagogical Efficacy
    • 4: Further Reading
    • 5: Discussion
  • Data Visualization
    • 1: Introduction
    • 2: Getting Started
    • 3: For Analysis and Understanding
    • 4: For Communication and Storytelling
    • 5: Visualizations and Accountability
    • 6: Recommended Reading/Viewing
    • 7: Discussion
  • Discussion

2: Metadata and Text Markup

Metadata comprises the “data about data” – information associated with archival material that lists key attributes, such as its author, date, publisher, or general subject. Metadata is far from new – the back of a book’s title page listing publication information is a straightforward example of metadata in non-digital form. Attaching this “data about data” to archival material is one of the most crucial steps in making that material findable and, consequently, usable by humanities scholars.

Metadata often falls under a broader process of text markup, whereby additional information is grafted onto the “raw” text of a document. There are different ways in which text can be marked up, but the most common in the archival world is Extensible Markup Language (XML). XML is a standardized set of rules for attaching information to text in order to make it readable by machines; like any language, it has its own syntax and conventions. XML works largely by wrapping chunks of text (words, sentences, paragraphs, etc.) in tags that describe what is between them. Tags can also be nested within one another for greater flexibility. Take the below example:

<painting>
 <caption>This is Raphael's "Foligno" Madonna, painted in
 <date>1511</date>–<date>1512</date>.
 </caption>
</painting>
Everything between the “painting” opening tag (<painting>) and the “painting” closing tag (</painting>) has to do with, unsurprisingly, a painting. The <caption> open and close tags surround an enclosed text indicating it as a caption, and the <date> tags specify which words in the caption refer to dates.

An archive with well-refined markup content allows users to search for specific terms (author, title, subject) within and across many different documents. When you perform a search in an archival database it might return information based on the metadata contained in the headers of each text. In the Early Americas Digital Archive, for example, users can search by genre (prose, poetry, drama), format (chronicle, diary, etc.), mode (satire, pastoral, etc.), historical period (in 50-year intervals), geographic location (New England, New Spain, Virginia, etc.), among others.  A user can search for “Georgic” “Poetry” about the “Caribbean” published “1750-1800.” These searches are built on the framework of metadata and text mark-up. The plain text of a document is inert: without marking it up, a computer would not be able to locate and retrieve this additional information.
Metadata and text mark-up has traditionally been generated by human labor in the form of decisions made by archivists about how to categorize and describe a source. These decisions are oftentimes more complex than they would initially appear: for instance, should Charlotte Bronte’s Jane Eyre be classified as a Bildungsroman or a late Gothic novel? Often older items will come with existing information and classifications, but they still require modifications. Glen Worthey, head of Stanford’s Humanities Digital Information Service, reminds us that “We interpret and map older forms of data into newer forms, but we not only need to map the old data but also put it into a common form so that the information works in a database.”

More specifically, in order for mark-up to be effective across institutions and archives there needs to be standards for how text should be marked up. One of the major groups working to develop and maintain standards for digitized texts is the Text Encoding Initiative (TEI). TEI is a consortium of academic, institutions, research groups, and individual scholars from around the world. Among the databases that use TEI standards are the Perseus Project, the Women Writers Project, the Early Americas Digital Archive, and the SWORD Project. The Online Archive of California (OAC) is another database consortium that coordinates metadata standards and provides free public access to detailed descriptions of primary resource collections maintained by more than 150 contributing institutions. In standardizing rules and syntax, initiatives such as TEI  and OAC try to ensure that searches become interoperable from one system to the next.

It is important to remember that the storage of information is not neutral. As Dan Cohen argues, “Scholars who structure historical documents with markup languages such as XML make choices—often quite good choices, but choices none the less—about which elements of a document are most important.” Many initiatives such as Google Books are attempting to automate the process in order to enhance millions of digitized sources. This leads to its own set of problems, as subtle distinctions can be lost. As Worthey says, “It’s dangerous for a humanities scholar to entrust too much to a programmer or mathematician.” The process of digitizing documents is not entirely a “sausage factory,” however. With a basic understanding of how online archives are created and organized, scholars will have a better sense of what they’re actually looking at and the quality of their sources. As more and more information goes digital, grasping the structure behind this information will become increasingly critical for scholarship and research.

1: Making Documents Digital 3: Further Reading

Navigation

  • Welcome
  • Workshop Series
  • About
  • Virtual You
  • Digitization
    • 1: Making Documents Digital
    • 2: Metadata and Text Markup
    • 3: Further Reading
    • 4: Discussion
  • Text Analysis
  • Spatial Analysis
  • Databases
  • Pedagogy
  • Data Visualization
  • Discussion
Powered by WordPress | “Blend” from Spectacu.la WP Themes Club