What is the PAGE XML format?
Posted by dschopper
on March 23, 2020, 11:36 a.m. (last update March 24, 2020, 9:36 a.m.)
status: draft
What is the PAGE XML format?
The ACDH OCR editor page-editor.acdh.oeaw.ac.at uses a format called PAGE, a XML format used in the manual creation of OCR (Optical Character Recognition) training data (so called „ground truth“).[1]
Building blocks of a PAGE document
A simple XML file according to the PAGE schema represents the content of one page of a document containing the following levels:
- Page: the physical page corresponding to the actual size of the paper (irrespective of the printed space on it – in German „Satzspiegel“)
- TextRegion: an area on the page containing lines of text
- TextLines: a line of consecutive characters that sum up to a TextRegoin
- Words: word in a TextLine (including adjacent punctuation marks)
- Glyphs: Characters making up a Word
Other items available in the editor: * Tables * Groups of other elements
References
https://www.primaresearch.org/www/assets/papers/ICPR2010PletschacherPAGE.pdf https://github.com/PRImA-Research-Lab/PAGE-XML https://github.com/mauvilsa/nw-page-editor (the implementation we are using)
[1] Actually, the nw-page-editor application validates against a slightly modified version of PAGE XML, called (omni:us Pages Format (OPF))[http://htmlpreview.github.io/?https://github.com/omni-us/pageformat/blob/fe1ca7f589dfd8aa9cc9abcc4a336e9afddd8aba/pagecontent_omnius.html].
export blog text