Skip to main content
Extract PDF tables with OCR

How to setup your OCR template to extract data from PDFs tables

Updated over 11 months ago

This article refers to the new version of the template editor for PDFs using OCR. Check out this page if you are looking at extracting tables from emails and text documents.

This tutorial assumes that you already know how to create OCR templates and will demonstrate the use of Labels.

How to teach Parseur where to start and stop a table?

Table fields have a variable number of rows, so you need to help Parseur understand where to start and especially where to stop.

We are going to use Parseur's Dynamic OCR capabilities using Labels for this. Check out the article at the link if you need more information about labels. We will create two relative labels, one for the start of the table and one for the end.

For example, below, the table field is relative to the Table Header label for the start and the Subtotal Label for the end:

Step by step instructions to extract a table from a PDF

To create a table field using dynamic OCR:

  1. Draw a box over the table you want to extract.

  2. Move or resize the box using the handles as appropriate. Make sure to cover the full table.

  3. Click on the New Table button

  4. The preview becomes available at the bottom of the screen

  5. Click in the table at the position where you want to split columns.

  6. Resize the column to ensure their width will accommodate any length of text

  7. The preview update includes the new columns

  8. Name the columns by clicking on their names in the table preview or by creating new names. Alternatively, if your selection contains the table header, you can click the Table headers included in the selection option to automatically name your columns.

  9. Create a label above the table to identify the start of the table and assign it to the "Start relative to" field. Make sure the label will be present on every document and at the same distance from the start or end of the table as in the current sample.

  10. Create a label below the table to identify the end of the table and assign it to the "End relative to" field

Note: Creating table fields inside a Table field isn't currently supported

How to handle tables spanning several pages?

Sometimes your PDFs have a header or footer that can get in the way of multipage tables.

Check out the example below:

Table with footer text included

The "Thank you for your business!" message in the page footer gets extracted as part of the table. Not good.

Fortunately, this is something easy to fix:

  1. Move your mouse over the footer line with the red dots

  2. Click to grab the line

  3. Move it above the footer text section and release the mouse button.

  4. Do the same with the header line as well, if needed

Table with footer margin

Now that we moved the red footer line above the footer text, it is excluded from the parsed data. Great!

How to define the row size? How to merge rows?

Sometimes, table rows can span several lines. Like in the example below:

This order includes 2 items, with the first item spanning 3 lines. However, by default, Parseur creates one row per line in the parsed data. Not good.

Parseur includes a Group-by option to tune the result:

  • Scroll down to the Group-by options on the right menu

  • Select the column to group rows on. Rows in other columns will be merged until new text appears in the chosen column. In our example, we want to create a new row, every time we see a new value in the quantity column, so we select the quantity column

  • Optionally, change the vertical alignment of the cells (default is Top)

Now that we have asked to group rows based on the "quantity" column, the parsed data is properly formatted, with only 2 rows in the result. Great!

How to handle complex tables or tables with varying column width?

The OCR table fields only work with simple "rectangular" tables with fixed-width columns. If you want to extract data from complex tables or lists, you could try using the AI parsing engine instead.

Did this answer your question?