This article refers to the new version of the template editor for PDFs using OCR. Check out this page if you are looking at extracting tables from emails and text documents.
This tutorial assumes that you already know how to create OCR templates and will demonstrate the use of Labels.
How to teach Parseur where to start and stop a table?
Tables fields have a variable number of rows, so you need to help Parseur understand where to start and especially where to stop.
We are going to use Parseur's Dynamic OCR capabilities using Labels for this. Check out the article in the link if you need more information about labels. We will create 2 relative labels, one for the start of the table, and one for the end.
For example below, the table field is relative to the Table Header label for the start and the Subtotal Label for the end:
Step by step instruction to extract a table from a PDF
To create a table field using dynamic OCR:
Draw a box over the table you want to extract.
Move or resize the box using the handles as appropriate. Make sure to cover the full table.
Click on the New Table Field button
The preview becomes available at the bottom of the screen
Click in the table at the position where you want to split columns.
Resize the column to ensure their width will accommodate any length of text
The preview updates with the new columns
Name the columns by clicking on their names in the table preview or creating new names. Alternatively, if your selection contains the table header, you can click the Table headers included in the selection option for automatically naming your columns.
Create a label above the table to identify the start of the table and assign it to the "Start relative to" field. Make sure the label will be present on every document and at the same distance from the start or end of the table than in the current sample.
Create a label below the table to identify the end of the table and assign it to the "End relative to" field
How to handle tables spanning several pages?
Sometimes your PDFs have a header or footer that can get in the way for multipage tables.
Check the example below:
The "Thank you for your business!" message in the page footer gets extracted as part of the table. Not good.
Fortunately, this is something easy to fix:
Move your mouse over the footer line with the red dots
Click to grab the line
Move it above the footer text section and release the mouse button.
Do the same with the header line as well if needed
Now that we moved the red footer line above the footer text, it is excluded from the parsed data. Great!
How to merge rows?
Sometimes, table rows can span several lines. Like in the example below:
This order includes 2 items, with the first item spanning 3 lines. However, by default Parseur creates one row per line in the parsed data. Not good.
Parseur includes some row-merge options to tune the result:
Scroll down to the Row Merge options on the right menu
Select the column to base row-merge on. Rows in other columns will be merged until new text appear in the chosen column. In our example, we want to create a new row, every time we see a new value in the quantity column, so we select the quantity column
Optionally, change the vertical alignment of the cells (default is Top)
Now that we have asked to merge rows based on the "quantity" column, the parsed data is properly formatted, with only 2 rows in the result. Great!