A Day In the Life

A foray into text encoding by Daniel Paul Rivera
Scroll Down


Historical financial records (HFRs [pronounced like “heifers”] for the abbrev-obsessed) constitute a large portion of archives in libraries around the world. Yet many a historian often disregard parsing through and transcribing such documents even with their plenitude, given a confluence of elements that include an aversion to arduous work, economic illiteracy, and misgivings about their ultimate utility. I find this state of affairs in scholarship unfortunate. There is a wealth of information latent in numbers recorded in ledgers and on receipts—information about what a person values and where her priorities lie. Indeed, embedded in the most detailed and extensive of HFRs are data about the desires and mores of entire peoples living in particular locations during specific eras. It is just this type of data I seek to collate or encode from the HFR that is Edward Hitchcock’s day book, which contains transactional entries that span decades of his life. But first…

What is encoding?

In all texts—from primary source documents like account books to the latest John Green novel you should be embarrassed admitting having read—there is a wealth of information. This information is more nuanced than surface-level “readings” of words in sentences. When performing a search function on a web page, for example, you are limited to mining but exact strings of characters. Consider a basic text search for “bloviating ignoramus” in an online Washington Post article about Donald Trump. Such a search yields 3 results, the search function in your browser treating each appearance of the donnish insult identical to the others.
Were the article a much larger corpus and “bloviating ignoramus” a more popular term, searching in like manner for a specific instance would yield results that contain a great deal of noise. (A search for the name Hamlet in Shakespeare’s eponymous play illustrates this point.) There are however other markers beyond the characters themselves that, if processed, could help winnow out the irrelevant results—markers such as source, speaker, and style, among other less explicit forms. It is this deeper-level information encoding (or markup or tagging, all synonymous terms) captures.

Laurence Olivier as Hamlet


Before they can be studied with the aid of machines, texts must be encoded in a machine-readable form. Methods for this transcription are called, generically, “text encoding schemes”; such schemes must provide mechanisms for representing the characters of the text and its logical and physical structure … ancillary information achieved by analysis or interpretation [may be also added] …

Michael Sperberg-McQueen, Text Encoding and Enrichment. In The Humanities Computing Yearbook 1989–90, ed. Ian Lancashire (Oxford: Oxford University Press, 1991)

Fossil Poetry

As the difficult-to-miss blurb floating on the marsala background situated relatively closely above this section states, texts must be encoded in computer-speak before they can actually be used by computers, not to mention, other humans who like enough conduct digital humanities-ish research. This process of transforming (alchemizing?) a text into a machine-readable format is called markup—a term that presumably derives from an editor’s method of literally marking up a text as they proofread it for errors, preparing it for eventual publication. According to the online OED, “[m]arkup may be used both to format the appearance of a text and to facilitate searching and other operations.” Indeed, markup has superficial and complex uses, both of which may work in unison to produce a user-navigable construction of texts analyzed through encoding.
The purpose of encoding within a text is to provide information which will assist a computer program to perform functions on that text (Hockey 2000).

There are a number of different ways to mark up a text. A systematized conglomerate of textual markups is called a markup language. The language you are most familiar with, in all likelihood, is HTML (the “ML” standing for markup language)—if not intimately as a web designer might be, then at least en passant, as a consumer of Internet content like this very web page, which is the consequence of a text composed of these words having been marked up in HTML to look as they do to you now.


Aside from HTML, there are other languages used for other ends. In addition to these languages, there are metalanguages: languages used to talk about other languages. XML is a metalanguage, the one most pertinent to my project.

<en><cod>N08/PL</cod><hw>Ruby</hw><def>An interpreted <xr>object-oriented programming language</xr> that includes features of both <xr>imperative</xr> and <xr>functional languages</xr>. Created by Yukihiro Matsumoto and released in 1995, it was influenced by such languages as <xr>Perl</xr>, <xr>Smalltalk</xr>, <xr>Eiffel</xr>, <xr>Ada</xr>, and <xr>Lisp</xr>. Its combination of elegance and apparent simplicity with underlying power has gained Ruby an steadily increasing popularity, especially through the <he>Ruby On Rails</he> web application <xr>framework</xr>.</def></en>

—a “simple” example of XML data from A Dictionary of Computing

The Dress of [Digital] Thought

Extensible Markup Language (XML) descends from standard generalized markup language (SGML). Their difference, remarkable as it is despite their shared components, is here de minimis. All I wish to impart is that XML is the metalanguage numerous developers use to define their own idiosyncratic markup language, aiming to display content in a specific stylized format primarily on the Internet. “It is used in many fields, academic and commercial, for documents, data files, configuration information, temporary and long-term storage, for transmitting information locally or remotely, for almost everything, including word processing documents, installation scripts, news articles, product information, and web pages” (Cummings 454). One organization that uses XML is the Text Encoding Initiative (TEI), a congress that works to determine and maintain detailed guidelines for textual markup.

After eventually deciding that I wanted to “do” something with the HFRs in the [mostly] digitized Edward and Orra White Hitchcock Papers Collection, I searched the internet for other digital humanities projects that feature HFRs and discovered an article in the Journal of Digital Humanities about the NEH-funded project Encoding Historical Financial Records. Understanding the potential of HFRs for substantial use in academic research, Associate Professor of History at Wheaton College Kathryn Tomasek set out to continue a conversation that began in 2011 on standardizing a system using the TEI’s guidelines for marking up HFRs like accounting documents, many of which conform loosely to the system of double entry bookkeeping the famed Medici family of Italy pioneered, which is strikingly the same system (in essence) accountants adhere to today.

This system of accounting does not easily translate into a metalanguage like XML, for reasons Assistant Professor Tomasek details in her published article. Suffice it to say, the TEI fails to capture the complexity of some of the most common documents found in the archives of the world’s libraries. This is unsurprising, however, considering that the initial goal of the TEI was to create guidelines for more conventional archival texts (think correspondence, diaries, poetry, &tc.). Regardless, one main advantage of XML is the X of it: extensibility. Unlike HTML, XML does not have a fixed set of tags (that which divides content into its individual, logical, structural components). So, while documentation re the standardization of markup for HFRs using TEI is sparse (to put it diplomatically), it is by no means absolute. It can be enlarged and explicated for those researchers who wish to make knowledge of the data latent in HFRs. The challenge then becomes a matter of translation and agreement, of turning debits and credits; commodities; borrowing and lending agents; units, prices, and unit prices; and the rest of the elements of balance sheets, income statements, statements of cash flow compiled from account books, day books, and ledgers—simply, of untying the Gordian knot that is recorded financial transactions—into a manageable, machine-readable (that is, XML-friendly) form that a body of TEI-associated individuals with a shared intent can work out and settle on.

A casual Medici get-together

A definition is the enclosing a wilderness of idea within a wall of words.
Samuel Butler

After defining the terms of my subject like the well-behaved philosophy major that I am, I can now begin writing about and showing you what my process looked like. It was simple, really, transmogrifying some of this:

An account book detailing the daily expenses and income of the Hitchcock household between 1828 and 1864. Some of the entries are in a hand other than Hitchcock's, possibly that of his wife Orra. Entries for 1864 are in Edward Hitchcock Jr.'s hand.

into this:

sample markup


in order to perhaps look like this:



To break it down for you, dear reader, I first started examining line by line and character by character Edward Hitchcock’s day book, and transcribing what I could make out (one of Hitchcock’s many talents was not calligraphy, unfortunately) into a Microsoft Excel spreadsheet. I organized my sheet into three columns, which contained data regarding the date a transaction took place, qualitative information about the transaction, and the amount of said transaction, respectively. My organization corresponds with Hitchcock’s own. This was the data collection part of my project. After reaching my limit of straining my eyes and furrowing my brow, I proceeded to put to use what I recorded by working in the XML Editor oXygen, which just happens to be “the best XML editor available” (according to their website). As XML requires, I began breaking down the transcription into its logical components. That means I first had to define with what I was working—a table—and what that itself consists of—columns of rows made up of cells. Even further, I defined “roles” for the cells, which differed in their nature (data vs. label, for example). This process took place for each integrant of my transcription, and will be detailed in a soon-to-be-published page that will detail my process. It is enough to say, I marked up everything in detail, including the exact characters Hitchcock marked out to signify that a debt had been paid, as well as information I supplied whenever Hitchcock’s hand was illegible or he had simply omitted, say, the word “pounds” after 6.5. This was surprisingly enjoyable, in much the same way solving a jigsaw puzzle is.

There will be more to come; this story needs an ending…


Cummings, James. “The text encoding initiative and the study of literature.” A Companion to Digital Literary Studies. Oxford: Blackwell (2008): 451-476.

Gold, Matthew K. Debates in the digital humanities. U of Minnesota Press, 2012.

Hockey, Susan M. Electronic texts in the humanities: principles and practice. Oxford University Press, 2000.

A Companion to Digital Humanities, ed. Susan Schreibman, Ray Siemens, John Unsworth. Oxford: Blackwell, 2004. http://www.digitalhumanities.org/companion/