In this article, we'll see how to set up Parseur to extract tables from emails and documents and parse the results into structured data.
Extracting tables from emails is very useful, for example, when you want to parse shopping orders coming into your mailbox. There are typically varying numbers of items in each email, and you don't want to create a template for each different number of ordered items. For each ordered item, you want to extract several data points, such as item name, quantity, and price.
Notes:
this article assumes you're already familiar with Parseur basics. If not, click here to get started.
this article is focused on extracting tables from emails and text documents. For PDFs, check out our extract tables from PDFs with OCR article.
Prefer a video?
A typical example: an amazon order
For this article, let's assume we want to extract details about book orders on Amazon. Whenever we order on Amazon, Amazon sends a confirmation email containing order details. An Amazon confirmation email looks like this:
We want to create a template that can extract order details, including for each item the book title and the book price. We want to create a single template that will extract all data, irrespective of the number of books ordered.
Here is what you get after processing this email in Parseur:
Reference:
123-4567890-1234567
Date:
Sunday, October 7
Customer Address:
Jane Doe FAIRVIEW HEIGHTS VAN HOUTEN HILL LONDON, ZZ9 ABC United Kingdom
Items[0].Title:
Fear and Loathing in Las Vegas
Items[0].Price:
£12.99
Items[1].Title:
The Tempest
Items[1].Price:
£9.99
Items[2].Title:
The Hitchhiker's Guide to the Galaxy
Items[2].Price:
£14.99
Total:
£37.97
And here is the same result, viewed in JSON format:
Step 1: Create a table field
In the template editor, select the whole table you want to capture.
Click New Table to create a new field for that selection.
The table option panel opens on the right-hand side, and the table gets highlighted in green.
Note: for best results, make sure the document you use for the template contains at least 2 rows (i.e., in our example, 2 books).
Note: Creating table fields inside a Table field isn't currently supported
Step 2: Set table name and options
Use the Table option panel on the right-hand side to customize your field:
Name: Enter a name for the table field
Output format: leave as is ("Table")
Input format: In most cases, leave it as is ("Vertical HTML Table"). Parseur will assume your table rows are laid out vertically, top to bottom. If your table rows are laid out horizontally from left to right, select "Horizontal HTML Table". If your table is Text only, choose "Vertical Text Table" (or "Horizontal Text Table" respectively).
Table Header: If the first row of your table selection contains the column names and Parseur was able to correctly identify columns in the result preview, tick the box "Table headers included in selection". Otherwise, go to Step 3 below to teach Parseur how to identify your columns.
Remove empty columns: By default, Parseur will remove empty columns. Uncheck the box if you prefer to keep empty columns. Keeping empty columns can be useful when your table has optional column fields that are not showing in the document used to create the template but can appear in other documents.
Step 3: Create table columns
At the bottom of your document, Parseur shows you a live preview of the table parsing.
You have several ways to customize Parseur table parsing to your needs.
Option 1: Use default parsing
If the table preview shows the data as you want it, great!
To name the columns:
Click on the column name
Select the column name from the list or type a new one
Option 2: Customize table parsing and columns
If Parseur doesn't get the table parsed the way you want it, you can improve the parsing result by telling Parseur which data you want to extract.
In this example, we are going to create column fields to capture the information we want to extract. And then, we're going to point those columns across multiple rows until the preview result is what we want.
To do this:
Select a piece of text you want to extract on the first row of the table
Click New Table Column
Enter a name for that new column and hit Enter (or click the tick button)
Repeat steps 1. to 3. for each column field you want to extract.
Now we need to teach Parseur how to split rows:
Select similar fields in the following rows and assign them to the existing field.
Do this until the result preview suits what you need.
When the table preview at the bottom of the document shows the desired result, you are done!
Option 3 (advanced): Further customize table parsing with separators
If you need to do some advanced table parsing, you can set your own table row and cell separators.
Parseur uses regular expressions to split the table into rows and the rows into cells.
You will find the regular expressions generated by Parseur in the Advanced Options panel, under Row Separators and Cell Separators. Tick the Manually set separators box to customize the separators.
The separators are regular expressions that Parseur will use to slice the table into cells.
Examples of using separators:
To slice the table into rows based on the HTML
</tr>
attribute, enter</tr>
in row separatorTo slice the table into rows based on a new line, enter \n in the row separator
To slice the rows into cells based on either
</th>
or</td>
, enter</th>|</td>
in cell separator (the pipe character|
means "or" in regexps)To slice the rows into cells for every 2 or more spaces (common for text-based tables), enter
\s{2,}
in cell separator (\s
means any space character and{2,}
means the previous pattern needs to be matched 2 or more times)
You can get very far with regular expressions (including using lookaheads and lookbehinds). Check out the following link for the full reference: https://docs.python.org/3.6/library/re.html
Step 4 (optional): Group and merge table rows
Parseur can group consecutive rows based on a specific column. For example, let's say you have the following table displayed in the table preview:
You want the food option rows to be merged with the food item above them.
To do so:
While editing your table field, scroll down to the Group-by options panel in the right menu
Select the column you want to group rows by. This is the column that tells Parseur to start a new row when it finds text in it. In our example, this is either the quantity or price column:
Table preview refreshes and shows you the table properly formatted:
By default, Parseur assumes text in rows is aligned to the top. If needed, change the Text alignment value to Middle or Bottom to fit your row's vertical alignment.
Step 5 (optional): Filter table rows
Parseur can filter out unwanted rows. For example, you may want to keep rows that only match a specific value in a certain column. You can do that by setting constraints on column fields.
To filter table rows:
Click on the edit button of the column field you want to set a constraint on
Open the Advanced options panel
In the constraint option, enter what the column needs to match in order to be kept. The constraint input accepts regular expressions.
To setup multi-column constraints, save this field and edit other column constraints.
Examples:
To filter on rows with John Smith in the name column, enter John Smith in the constraint input of the name column.
To filter on rows where column ID ends with ABC, enter ABC$ in the constraint input of the ID column
Note: By default, if the filter returns no row, Parseur will set the document to "New Template Needed". If you want to allow tables with empty rows, check the Allow empty tables box when editing the table field.
Step 6: Save your changes
Click the Save Field button when you're done editing the table field.
Then, continue editing the template by adding new fields and finally click Create Template when completed.
Step 7: Export your table data
You have several strategies available when it comes to exporting your table data. This section lists the different strategies depending on the type of export.
Download and Google Sheet
There are 2 options to export your table data:
The default export keeps the "1 document per sheet row" convention: table field rows will be added as columns.
The table export uses the "1 table field item per sheet row" convention: each table field item will be on its own sheet row, and other fields extracted from the document will be repeated across rows.
Zapier
Send the table directly through Zapier, each row in this table will turn into a row (or item) in the target application (for example, a row in a Google Spreadsheet). Non-table fields in the document are repeated for each row. This is the trigger you will most likely need.
There are other possible triggers you can use for tables with Zapier. Check out this article for more information about using Zapier.
Webhooks
There are 3 options to export your table fields via webhooks:
Document processed: this is the default option. JSON payload is a deep key/value object with table fields' values sent as an array (equivalent to the JSON screen capture at the beginning of this article)
Document processed (flattened): JSON payload is a flat key/value object with keys named as in the default table export (e.g. "Items[1].Title")
Table item processed: JSON payload is an array of table item objects where other fields are repeated across table items (similar to the table export for Google Sheets).
Check out this article for more information about using Webhooks.
Microsoft Power Automate
Check out this article for more information about using MS Power Automate.