What is the PAGE XML format?

The ACDH OCR editor page-editor.acdh.oeaw.ac.at uses a format called PAGE, a XML format used in the manual creation of OCR (Optical Character Recognition) training data (so called „ground truth“).[1]

Building blocks of a PAGE document

A simple XML file according to the PAGE schema represents the content of one page of a document containing the following levels:

Other items available in the editor: * Tables * Groups of other elements


https://www.primaresearch.org/www/assets/papers/ICPR2010PletschacherPAGE.pdf https://github.com/PRImA-Research-Lab/PAGE-XML https://github.com/mauvilsa/nw-page-editor (the implementation we are using)

[1] Actually, the nw-page-editor application validates against a slightly modified version of PAGE XML, called (omni:us Pages Format (OPF))[http://htmlpreview.github.io/?https://github.com/omni-us/pageformat/blob/fe1ca7f589dfd8aa9cc9abcc4a336e9afddd8aba/pagecontent_omnius.html].

