Digitize a List of Names and Dates

For the digitization of the letters of Arthur Schnitzler the publishers provided a list of all the recipients and dates of the letters. In order to match the correct date to each letter we now want to digitize this list and put it in a nice little TEI-list, which is a relatively simple process using the transcription software Transkribus and the Oxygen XML Editor. This is a step by step guide for beginners.


Step 1: Layout Analysis and Text Recognition

Transkribus is a free software that offers many useful tools for layout analysis and text recognition. Just sign up and download at https://transkribus.eu/Transkribus/. Once installed you can log in and use the “Import document(s)” button to load a scan of the pages you want to digitize into Transkribus. Just click on the document to open up the page view.

By clicking on the tab “Tools” we can view our options for layout analysis and text recognition. Abbyy FineReader OCR will do both in one run, while CITlab will do them separately. As our list is laid out in three columns, which usually get read as one, we will stick with CITlab so we can modify the layout before we run the transcription.

So let’s run Layout Analysis on the document. When the columns are too close together the layout analysis might view them as one text region. We can separate them manually with the tools in the toolbar left of the page view. The tools also allow us to add lines in case the layout analysis skipped any. We can also delete text regions and lines we don’t want to transcribe such as the page number. (If you need a full citation for your document make sure you keep the page numbers. For this list we don’t need them.) Also check if the recognized lines cover all the symbols on the paper, as brackets sometimes get left out.

Once the layout is correct, we continue with text recognition. We can choose from different models of text recognition that were trained for a specific type of print or handwriting. It is best to try out a few and see which one gives the most accurate result. We now have a transcription of the document that we can work with. So let’s export the transcription as a TEI document.


Step 2: Correcting

OCR isn’t perfect, you will notice that as soon as you work with it. Sometimes it’s only one character that is continuously misread and can be fixed using “search and replace”. Other times it is hard to avoid manual correcting. In case of this list, the characters ‘2’ and ‘3’ were constantly misread as a variety of other symbols. So there was no efficient way around manual correction. It’s always good to check for ‘1’ being misread as ‘I’ or ‘]’ being misread as ‘1’.

When there isn’t one fixed date available for a letter, the editors like to write something like “Anfang August 1897” (“beginning of August 1897”), which we would like to avoid. So we can now take the time to replace any written months like “August” with the corresponding number to make sure these dates are included in the following steps. We might also replace words indicating the beginning, middle or end of a month with the numbers “1”, “15” and “30”(unless it applies to the month of February) and add a bracket or question mark as an indicator that this date is not precise.


Step 3: Structure and Tags

Next we want to separate dates and names, as our document currently looks like this:

Using Regex we can identify dates as any element containing a four digit number. In some cases the day and the rest of the date were split up like <l> 6. </l> <l> 4. 1896 </l>. We will fix these later, but instead tag any item that contains only one number with a tag of our choice to refer to later. * You might have to work outside the TEI schema for a while or use a tag that is allowed by TEI

We can assume that all remaining elements are names and tag them as such. The XSL template could look as follows: ```

            <xsl:when test="not(tokenize(.)[2]) and matches(., '(\d)+')">
                <xsl:element name="day">
                        <xsl:value-of select="."/>
                </xsl:element>
            </xsl:when>

            <xsl:otherwise>
                <xsl:element name="name">
                        <xsl:value-of select="."/>
                </xsl:element>
            </xsl:otherwise>
    </xsl:choose>

```

Next we want to remove any elements related to the page or page number, so all names and dates are on the same level and have the same parent node. We can simply use the tools Oxygen provides by clicking on “Tools” -> “XML refactoring” -> “unpack element”. Next we can turn our list into an actual list by adding a “list” tag at the beginning and end of the series of names and dates and put every name in an “item” element. Again the easiest way is to use “search and replace” to add the “item” tags in front of every opening name tag and manually correct the ones at the beginning and end of the document. Now the document looks like this:


Step 4: Correcting

And again we have a good point for checking for potential errors. We can check for dates being tagged as names (because of OCR errors, different layout in the original etc.) or names not followed by dates. These are just two examples but using XPath queries you can simply check for any instances that don’t follow your intended layout. The earlier errors are corrected the less trouble they can cause later on.

Now we also want to fix our “lost days” as I like to call them. Looking through all the occurrences we can see if there is a pattern to them, like the missing day always being in the line above the rest of the date. In our case this was not always the case and there were some accumulations of this error that made it nearly impossible to fix it in a simple XSLT Transformation. Depending on the size of your document you can try to work out a rule based solution, but with this relatively short list manual correction seemed to be the easiest way.


Step 5: Adding ISO-Dates

When all our dates consist of day month and year as far as provided, we need to transform them into ISO dates using the format “YYYY-MM-DD”. Before we do that we want to remove as many parts of the date as possible. Brackets and question marks indicating the uncertainty of a date are the first ones to go. We write a simple XSLT Transformation that adds the attribute “cert” and sets it to “low” if any of these symbols are included. The indicators then get removed from the date.

 <xsl:template match="tei:date">
        <xsl:copy>
            <xsl:if test="contains(., '?') or contains(., '[') or contains(., ']')">
                <xsl:attribute name="cert">
                        <xsl:text>low</xsl:text>
                </xsl:attribute>
            </xsl:if>

            <xsl:value-of select="replace(., '(\?|\[|\])', '')"/>
        </xsl:copy>
    </xsl:template>

Once the brackets and question marks are removed the only remaining symbols are the dots between day, month and year. So we replace them with white space to make sure all components stay separated. Using the tokenize() function in XSL we can now add an attribute called “when” and build the ISO date from the components. If a date only consists of two components (in this case the month and the year) we can instead use the attributes “notBefore” and “notAfter” to narrow down the possible time period. This would be from the first of the known month to the first of the following month. A special case would be “summer 1897” where we need to set a reasonable start and end date. This is what the code might look like:

<xsl:template match="tei:date">
        <xsl:copy>
            <xsl:if test="./@cert">
                <xsl:attribute name="cert">
                    <xsl:value-of select="./@cert"/>
                </xsl:attribute>
            </xsl:if>

            <xsl:if test="tokenize(.)[3]">
                <xsl:attribute name="when">
                    <xsl:value-of select="tokenize(.)[3]"/>
                    <xsl:text>-</xsl:text>
                    <xsl:if test="string-length(tokenize(.)[2]) &lt; 2">
                        <xsl:text>0</xsl:text>
                    </xsl:if>
                    <xsl:value-of select="tokenize(.)[2]"/>
                    <xsl:text>-</xsl:text>
                    <xsl:if test="string-length(tokenize(.)[1]) &lt; 2">
                        <xsl:text>0</xsl:text>
                    </xsl:if>
                    <xsl:value-of select="tokenize(.)[1]"/>
                </xsl:attribute>

                <xsl:value-of select="tokenize(.)[1]"/>
                <xsl:text>. </xsl:text>
                <xsl:value-of select="tokenize(.)[2]"/>
                <xsl:text>. </xsl:text>
                <xsl:value-of select="tokenize(.)[3]"/>
            </xsl:if>

            <xsl:if test="not(tokenize(.)[3])">
                <xsl:attribute name="notBefore">
                    <xsl:value-of select="tokenize(.)[2]"/>
                    <xsl:text>-</xsl:text>
                    <xsl:choose>
                        <xsl:when test="tokenize(.)[1] = 'Sommer'">
                            <xsl:text>06</xsl:text>
                        </xsl:when>
                        <xsl:otherwise>
                            <xsl:if test="string-length(tokenize(.)[1]) &lt; 2">
                                <xsl:text>0</xsl:text>
                            </xsl:if>
                            <xsl:value-of select="tokenize(.)[1]"/>
                        </xsl:otherwise>
                    </xsl:choose>
                    <xsl:text>-01</xsl:text>
                </xsl:attribute>

                <xsl:attribute name="notAfter">
                    <xsl:choose>
                        <xsl:when test="tokenize(.)[1] = '12'">
                            <xsl:value-of select="number(tokenize(.)[2]) + 1"/>
                        </xsl:when>
                        <xsl:otherwise>
                            <xsl:value-of select="tokenize(.)[2]"/>
                        </xsl:otherwise>
                    </xsl:choose>

                    <xsl:text>-</xsl:text>
                    <xsl:choose>
                        <xsl:when test="tokenize(.)[1] = 'Sommer'">
                            <xsl:text>09</xsl:text>
                        </xsl:when>
                        <xsl:when test="tokenize(.)[1] = '12'">
                            <xsl:text>01</xsl:text>
                        </xsl:when>
                        <xsl:otherwise>
                            <xsl:if test="number(tokenize(.)[1]) &lt; 9">
                                <xsl:text>0</xsl:text>
                            </xsl:if>
                            <xsl:value-of select="number(tokenize(.)[1]) + 1"/>
                        </xsl:otherwise>
                    </xsl:choose>
                    <xsl:text>-01</xsl:text>
                </xsl:attribute>
                <xsl:value-of select="tokenize(.)[1]"/>
                <xsl:text>. </xsl:text>
                <xsl:value-of select="tokenize(.)[2]"/>
            </xsl:if>
        </xsl:copy>
    </xsl:template>

Step 6: Correcting

Correcting mistakes from iso dates, like additional dots, brackets and other transformation errors. Switching names around

As always errors find their way into the document. Any leftover symbols that have been overlooked up until this point will now need correction. You can also check if there are any empty elements or chronological errors within an “item” element. Once that is done our list is complete. In order to match it to the names in the actual letters we want to switch the order from “‘last name’, ‘first name’” to “‘first name’ ‘last name’”. The tokenize() function and the comma can be used for this purpose.

<xsl:template match="tei:name">
        <xsl:copy>
            <xsl:for-each select="tokenize(.)">
                <xsl:choose>
                    <xsl:when test="contains(.,',')"></xsl:when>
                    <xsl:otherwise>
                        <xsl:value-of select="."/>
                <xsl:text> </xsl:text>
                    </xsl:otherwise>
                </xsl:choose>

            </xsl:for-each>
            <xsl:value-of select="replace(tokenize(.)[1],',','')"/>
        </xsl:copy>
    </xsl:template>

Now you will only need to correct the names of institutions or newspapers that do not follow the previous layout. The document now looks as follows and is ready to be matched to the actual letters. You can read up on that process here .


Comment/Edit this post on GitHub.
export blog text