Step 1: Clean up regions

Step 1: Clean up regions

Depending on the origin of your data, the OCR engine might have wrongly identified some parts of the page as text regions which they are not. Frequently, these are shadows in the binding, horizontal separator lines in the text, stains on paper etc. Normally, such text regions contain arbitrary signs which make the whole page not well readable, so it is important to remove these regions (in all levels – TextRegions, Lines, Words etc.) from the page.

Follow these steps to clean up such noise on the page:

Continue this procedure until there are no superfluous regions left on the page.

Save your edits by clicking on the „Save“ button in the menu and move on to the next page.

export blog text