Plotting paratexts 01: extracting dates

Before I move on to the main issue of this blog, and lose those of you who do not share our Paratextomania, you might find the following useful if you wish to analyze the TCP corpora. As historians, one of the first things we would like to know about any piece of information – be it a use of a word, a text, a trend – is its date.  In the case of the Early Modern TCP annotated texts, this means we would have to extract this information from the <DATE> tag in the header. The path to its location in the header is:

‘FILEDESC/SOURCEDESC/BIBLFULL/PUBLICATIONSTMT/DATE’ (and not to be confused with the transcription <DATE> in the file description).

In this Github folder you will find the scripts I used to extract the information, along with their results.  One word of caution to those of you with programming knowledge and experience: you may be appalled by what you may find in my Github. I am not a programmer, and my scripts are a patch work of my struggling with the code, and some help from my friends (thank you Shay Rojanski and Harel Cain!) who bear no responsibility to any of my mistakes or inelegances. I will be most grateful for any corrections and improvements, and even more so if you allow me to share them!

Dateinfo.py parses the files, and calls two functions from another script,  Get_date_and_decade.py . The getyear function partly cleans the dates from text and punctuation marks, and recognizes cases where the date is given as range, with alternatives, put in question or unclear. The results are given in two text files (TSV) and an excel sheet, ordered by:

file-number/ title /year as it appears in the text/ year/ extracted/ decade. For example:

Screen Shot 2015-01-01 at 10.22.29 AM

In fact, if one is only interested in the dates of certain titles, or even the distribution of the corpus’ texts over time, there is no need to go through the trouble: the Metadata on all TCP texts has been made available on the TCP github (thank you Sebastian Rahtz, James Cummings and Joe Wicentowski!).  EEBO-tcp (including phase 2) was also nicely visualized on the Early Modern Print text mining platform.  Gleaning years and decades from the texts is useful mainly if there are other variables in the corpus that we wish to date, that are not in the metadata, nor extractable from the words of the raw texts (in which case the various interfaces for EEBO-TCP may be of use).

Another reason to take our information from the TCP files, is if we wish to see the dates as they appear in the annotated texts, unnormalized. As we can see below, more than 10% of the texts were given as range, as questionable or with possible alternatives:

Screen Shot 2015-01-02 at 8.04.19 PM

The difference between the corpora is minor: A larger percent of the 18th century books were given a range of date, in the form dddd-dd. This mainly comes to remind us that differences between corpora should always be treated more suspiciously, as they might merely reflect changes in annotation policies – or habits.

Screen Shot 2015-01-02 at 8.08.36 PM Screen Shot 2015-01-01 at 10.15.38 AM

What do we do with more than tenth of the files, which cannot be accurately and exactly dated? Some may chose to handle them differently, normalize them or use the normalized dates in the metadata tables. I chose to take them out of my corpus and proceed only with the 90% of the corpus that have exact, certain dates.

About Sinai

Post-doctoral fellow at the Polonsky academy, Jerusalem. Interested in text mining the language of dedications, prefaces, letters to the reader and other mainly - but not only - Early Modern kinds of paratext, and more generally, in what the digital humanities may hold for the study of paratext.
This entry was posted in Uncategorized. Bookmark the permalink.