Digitize a Collection of Letters using Transkribus and XSLT

I spent the last two weeks planning and preparing the digitization of two collections of letters written by Arthur Schnitzler and originally published in print. In the following blog post I would like to take you along and guide you through the steps including some tips on how to avoid annoying mistakes. We will be using the transcription program Transkribus and the Oxygen XML-Editor. All files are available on github in case you want to take a closer look.


Getting started with Transkribus

Transkribus is a simple-looking but very useful transcription software. You can register with a new account here and download it. Once installed you can simply log in and start working. By clicking on the field below “Collections” you can easily switch between different projects or start a new collection. My supervisor Martin had already prepared a collection, where he had uploaded a scan of all the letters.

Once the documents are uploaded we can start to use Transkribus’ tools, which you can find in the tab called “Tools”. They include layout analysis, text recognition, computing the accuracy of a model of text recognition and other tools, one of which is P2PaLA, “a layout analysis tool that recognizes structure types on region level (...) based on pre-trained models.” (Transkribus: P2PaLA, 25.07.2019)

What does that mean for our document? First we need to establish some type of layout, so the OCR can do its job. Our options are running the CITLab Advanced layout analysis tool, the AbbyFineReader OCR, which does both layout analysis and transcription at once, or using the tools in the small sidebar left of the document (which are great for manual corrections). Using a P2PaLA model for layout recognition gives you the advantage of having your text regions tagged with structure tags such as “header”, “caption” or “footnote” (the tags depend on the model you use). These tags can make the digitization a lot easier, as you will be able to address these structure types in the transcribed document you export from Transkribus. As none of the models currently available provides a sufficient tag-set for our letters training a new model seems to be the best approach to digitize 1500 pages.

This means that we need a manually tagged training set for the model. So I started by running Abbyy FineReader on a selection of letters, as its layout recognition is very good and needed only a few corrections.

In the tab called “Layout” you get an overview over the structure of the page. By right-clicking on a text region in you document you are able to assign a structure type. In the tab “Metadata” you can customize these structure types to your needs. I based my structure types mainly on the TEI tags the final TEI document is supposed to have. By correcting and tagging the text regions we have created a training set for the P2PaLA model as well as a test set we can us to work out our next steps.

Once the training set is tagged we can run some HTR on the set. Just experiment with the available models to see which one gives you the most accurate transcription. While waiting for a response from the Transkribus team regarding the training of the layout recognition model, we can now export our test set to work out our next steps.


XSL: From Chaos to a neatly structured Document

Let’s open the files using the Oxygen XML editor. Before we continue we need to combine these individual files into one file. This link might be helpful, if you’re stuck. Once all our files are merged we can take a closer look. The XML file might throw an error message as the predefined schema doesn’t allow the element “TranskribusMetadata”, so you’ll want to remove that one first.

While the structure tags are giving the exported document a good structure, it may not look like it when you first open the document.

There are many elements we don’t need. Also the structure tags added in Transkribus are part of the attribute “custom” of the element “TextRegion”. So the first step to get started with XSLT is writing a little Transformation to clean up and rename the elements in our document. It might look like this:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:page="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns="http://www.tei-c.org/ns/1.0"
    version="3.0">

    <xsl:mode on-no-match="shallow-copy" />

    <xsl:template match="page:ReadingOrder"></xsl:template>
    <xsl:template match="page:PrintSpace"></xsl:template>
    <xsl:template match="page:Metadata"></xsl:template>
    <xsl:template match="page:SeparatorRegion"></xsl:template>

    <xsl:template match="page:TextRegion[contains(@custom,'title')]">
                <xsl:element name="title">
                    <xsl:value-of select="./page:TextEquiv/page:Unicode"/>
                </xsl:element>
   </xsl:template>

Now the document looks a little nicer and it will be easier to address all the remaining elements. (This is also a good time to CHECK the document for unwanted elements, that might not have been tagged correctly by the layout analysis.) The elements of the letters are, however, still tied to the page they were printed on. We want to remove the element “Page” so that all our letter components are on the same level. What we want to keep though, is the page number, so we can properly cite each letter. I tagged the page numbers beforehand as part of the structure tags in Transkribus, so the element “Page” can simply be unpacked.

Now we can try to separate the letters into individual elements. But where does a letter start and where does it end?

It sound trivial at first but it isn’t always as easy as you might expect. A letter might not always have a closer or a dateline. I this case we get to thank the editors of the original publication for putting a title above every letter. We can now assume that a letter starts with the title and ends with the last element before the next title. So we type up a little XSL Transformation, though a few challenges present themselves. In order to get all the paragraphs up to the next title we have to rely on recursion, as looping up to a specific point is not possible in XSL.

<xsl:template name="getP" xpath-default-namespace="http://www.tei-c.org/ns/1.0">
        <xsl:param name="par"/>
        <xsl:param name="firstTitle"/>

        <xsl:if test="$par/preceding-sibling::title[1] = $firstTitle">
            <xsl:value-of select="$par"/>
            <xsl:call-template name="getP">
                <xsl:with-param name="par" select="$par/following-sibling::p[1]"/>
                <xsl:with-param name="firstTitle" select="$firstTitle"/>
            </xsl:call-template>
        </xsl:if>
    </xsl:template>

Also the page numbers get tricky. In most cases a letter ends on the same page as the next one begins. In some cases, however, a new letter starts on a new page. How can we tell the difference? We have to CHECK the first preceding sibling of the title. In the case of our letters the printed publication has a header on each page. Another point of reference could be the page number which is usually placed at the top or bottom of a page. This rearrangement is also a good time to introduce the general outline of a TEI document, which consists of a header and the text. The header may remain empty for now, but by creating the element now we can easily address it later on. The header doesn’t stay completely empty though, as any information that should not be in the text but in the header, can already be moved there for later modification.


XSL: Adding Information from a different Document

With the body of our letters being more or less completed, we now need to fill in some information for the header. By writing an XSL template matching our header element we can easily add the structural elements that TEI requires, such as the bibliographical information or the description of the correspondence. Most of it is already known or can be extracted from the document. The addressee for example can be taken from the title of each letter using the tokenize()-function. Only the dates present themselves a problem, as their format changes from letter to letter and isn’t always complete. We can, however, rely on the resources made available by the editors who included a list of all recipients and dates of the letters in their edition. By digitizing it and bringing it into a clean format, we can use this to match the letters based on the name of the recipient and from there pick the correct date off the list. The digitization of this list is documented here.

Provided the list is now available in a digital format, we need to CHECK the following things: Are all the letters in chronological order? Are the names spelled correctly by the OCR? Otherwise matching the dates to the letters will be a lot more difficult. We can then write an XSL Transformation that could look like this:

<xsl:param name="addr" select="document('Briefverzeichnis1-v5-final.xml')"/>
    <xsl:key name="name-of" match="tei:item" use="tei:name"/>

    <xsl:template match="correspDate">

        <xsl:variable name="name" select="ancestor::correspAction/following-sibling::correspAction/child::persName"/>
        <xsl:variable name="letterCount">
            <xsl:number count="tei:TEI[descendant::correspAction[@type='received']/child::persName=$name]" from="PcGts" level="any"/>
        </xsl:variable>

        <xsl:element name="date">
           <xsl:attribute name="when">
                  <xsl:value-of select="key('name-of',ancestor::correspAction/following-sibling::correspAction/child::persName ,$addr)/tei:date[number($letterCount)]/@when"/>
            </xsl:attribute>
            <xsl:value-of select="key('name-of',ancestor::correspAction/following-sibling::correspAction/child::persName ,$addr)/tei:date[number($letterCount)]"/>
        </xsl:element>
    </xsl:template>

Now we can CHECK if there are any empty “date” elements where no match was found and correct any potential mistakes.


XSL: File Separation and CMIF File

Now that every letter is provided all necessary information (for now) we can separate them into individual files. The XSL for this could look as follows:

<xsl:template match="TEI" xpath-default-namespace="http://www.tei-c.org/ns/1.0">
        <classpath location="/home/ap/saxon/saxon8.jar" />
        <xsl:variable name="current">
            <xsl:number/>
        </xsl:variable>

            <xsl:result-document method="xml" href="file_{$current}-output.xml">
                      <xsl:copy-of select="." />
            </xsl:result-document>

    </xsl:template>

One thing that can also be done once all information on the correspondence is filled in is the preparation of a CMIF file. CMIF stands for Correspondence Metadata Interchange Format and consists mainly of the “correspDesc” elements of our collection of letters. This CMIF file can be added to correspSearch where correspondence are linked and searchable by date, place and person. For a more detailed description click here. This can, however, wait until the rest of the project is finished.


Overview: The Project Files and Status Update

As of July 30 2019 the layout analysis model is not yet trained. While the team of Transkribus has agreed to “have a look at it”, we are still waiting for further information. If the training of the model works as well as we hope, it would speed up the digitization process immensely as we no longer have to start with a flat transcription but are provided structure tags from the beginning. In four relatively quick steps we can go from a transcription to a basic TEI document.

All project files up to this point are available on github. The following list gives a short description of the XSL files used in the process:


Comment/Edit this post on GitHub.
export blog text