2: Metadata and Text Markup
Metadata often falls under a broader process of text markup, whereby additional information is grafted onto the “raw” text of a document. There are different ways in which text can be marked up, but the most common in the archival world is Extensible Markup Language (XML). XML is a standardized set of rules for attaching information to text in order to make it readable by machines; like any language, it has its own syntax and conventions. XML works largely by wrapping chunks of text (words, sentences, paragraphs, etc.) in tags that describe what is between them. Tags can also be nested within one another for greater flexibility. Take the below example:
<painting> <caption>This is Raphael's "Foligno" Madonna, painted in <date>1511</date>–<date>1512</date>. </caption> </painting>
More specifically, in order for mark-up to be effective across institutions and archives there needs to be standards for how text should be marked up. One of the major groups working to develop and maintain standards for digitized texts is the Text Encoding Initiative (TEI). TEI is a consortium of academic, institutions, research groups, and individual scholars from around the world. Among the databases that use TEI standards are the Perseus Project, the Women Writers Project, the Early Americas Digital Archive, and the SWORD Project. The Online Archive of California (OAC) is another database consortium that coordinates metadata standards and provides free public access to detailed descriptions of primary resource collections maintained by more than 150 contributing institutions. In standardizing rules and syntax, initiatives such as TEI and OAC try to ensure that searches become interoperable from one system to the next.
It is important to remember that the storage of information is not neutral. As Dan Cohen argues, “Scholars who structure historical documents with markup languages such as XML make choices—often quite good choices, but choices none the less—about which elements of a document are most important.” Many initiatives such as Google Books are attempting to automate the process in order to enhance millions of digitized sources. This leads to its own set of problems, as subtle distinctions can be lost. As Worthey says, “It’s dangerous for a humanities scholar to entrust too much to a programmer or mathematician.” The process of digitizing documents is not entirely a “sausage factory,” however. With a basic understanding of how online archives are created and organized, scholars will have a better sense of what they’re actually looking at and the quality of their sources. As more and more information goes digital, grasping the structure behind this information will become increasingly critical for scholarship and research.