4: Content-Based Analysis
While stylistics examines how a text was written, other forms of text mining attempt to discover patterns in what was written – the content of a text.
Finding particular words or phrases has been one of the oldest forms of computational text analysis. Most humanists perform this kind of analysis all the time – every time you search for a book on an online library catalog or for a word within a database of primary sources. Oftentimes these searches can be made much more effective by customizing the retrieval process. For instance, basic skills such as mastering Boolean search terms or using a library’s RSS feed can improve humanistic research.
The most fundamental form of content-based text analysis is that of counting words. How often do certain words or phrases appear in a text? Google recently released its Ngrams tool, a search interface that allows users to track the relative frequency of phrases (up to five words in length) over hundreds of years and over 5 million digitized books written in multiple languages.
Topic modeling is another method of digital analysis that allows historians to work with large amounts of textual data. It does so by employing techniques from computational linguistics to identify clusters of common words that appear similarly in the text.
Historian Sharon Block, for instance, used topic modeling to analyze early American newspapers. Block performed quantitative analysis of over seventy years of the Pennsylvania Gazette, which included approximately 82,000 articles and advertisements. Block’s topic categories were not preselected before running the topic model. Rather, the computer used the Gazette’s contents to determine topics organically, based on the words that appeared in similar contexts within the text. For example, the program grouped words such as “general,” “officer,” “enemy,” “army,” and “troop” together – clearly a topic relating to the military. These topics could then be tracked across the pages of the Gazette to see which topics received more or less attention in various years.
Block’s method is “bottom-up,” because it looked at where and when certain words tend to appear, disregarding any consideration of the words’ meanings. Articles that contained a certain cluster of words were grouped together, even though the program had no sense of what those words actually meant. The words “army” and “troop” may as well have been “circuit” and “bread” – all the program focused on was how they appeared in the text.
In contrast, a “top down” mode of text analysis is one in which a scholar provides more rigid categories or search terms. For instance, the words “city,” “alley,” “factory,” and “tenement,” might be selected by the researcher in order to define an “urban” topic. Once search terms are chosen, certain programs can not only retrieve results, but can also show where those words occur in greater concentration.
Topic modeling allows historians to measure topic trends for a particular source. Was there a Great Awakening? What were its temporal and regional dimensions? Text analysis of various newspapers could potentially measure the frequency of “religious” topics across time and place and offer evidence to help answer such questions.