Extract PDF tables with OCR

How to setup your OCR template to extract data from PDFs tables

Updated over a week ago

This article refers to the new version of the template editor for PDFs using OCR. Check out this page if you are looking at extracting tables from emails and text documents.

This tutorial assumes that you already know how to create OCR templates and will demonstrate the use of Labels.

How to teach Parseur where to start and stop a table?

Tables fields have a variable number of rows, so you need to help Parseur understand where to start and especially where to stop.

We are going to use Parseur's Dynamic OCR capabilities using Labels for this. Check out the article in the link if you need more information about labels. We will create 2 relative labels, one for the start of the table, and one for the end.

For example below, the table field is relative to the Table Header label for the start and the Subtotal Label for the end:

Step by step instructions to extract a table from a PDF

To create a table field using dynamic OCR:

  1. Draw a box over the table you want to extract.

  2. Move or resize the box using the handles as appropriate. Make sure to cover the full table.

  3. Click on the New Table Field button

  4. The preview becomes available at the bottom of the screen

  5. Click in the table at the position where you want to split columns.

  6. Resize the column to ensure their width will accommodate any length of text

  7. The preview updates with the new columns

  8. Name the columns by clicking on their names in the table preview or by creating new names. Alternatively, if your selection contains the table header, you can click the Table headers included in the selection option for automatically naming your columns.

  9. Create a label above the table to identify the start of the table and assign it to the "Start relative to" field. Make sure the label will be present on every document and at the same distance from the start or end of the table as in the current sample.

  10. Create a label below the table to identify the end of the table and assign it to the "End relative to" field

Note: Creating table fields inside a Table field isn't currently supported

How to handle tables spanning several pages?

Sometimes your PDFs have a header or footer that can get in the way of multipage tables.

Check the example below:

Table with footer text included

The "Thank you for your business!" message in the page footer gets extracted as part of the table. Not good.

Fortunately, this is something easy to fix:

  1. Move your mouse over the footer line with the red dots

  2. Click to grab the line

  3. Move it above the footer text section and release the mouse button.

  4. Do the same with the header line as well if needed

Table with footer margin

Now that we moved the red footer line above the footer text, it is excluded from the parsed data. Great!

How to define the row size? How to merge rows?

Sometimes, table rows can span several lines. Like in the example below:

This order includes 2 items, with the first item spanning 3 lines. However, by default Parseur creates one row per line in the parsed data. Not good.

Parseur includes some row-merge options to tune the result:

  • Scroll down to the Row Merge options on the right menu

  • Select the column to base row-merge on. Rows in other columns will be merged until new text appears in the chosen column. In our example, we want to create a new row, every time we see a new value in the quantity column, so we select the quantity column

  • Optionally, change the vertical alignment of the cells (default is Top)

Now that we have asked to merge rows based on the "quantity" column, the parsed data is properly formatted, with only 2 rows in the result. Great!

Did this answer your question?