This article details the pros and cons of each parsing engine available in Parseur and guides you through which one to use.
3 parsing engines
Parseur has three parsing engines available to you:
an AI-based engine that works with all types of documents
a template-based OCR engine that is specifically made for parsing PDFs
a template-based text engine for parsing emails and text documents, but that actually works for all types of documents.
You can check what engines are available by looking at the upper right corner of a screen when viewing a document:
โ means the engine is supported for this type of document
โ means the engine is not supported for this type of document
โธ๏ธ means the engine is not enabled in this mailbox. You can enable it in the mailbox settings.
In the example below, we have an email sitting in a Parseur mailbox with AI disabled, so only the text parsing engine is available:
You can hover over each badge to get more information.
AI-based parsing engine
How it works
For this engine to work, it needs to be enabled at mailbox creation or in your mailbox settings
With the AI engine, use field names to describe the data you want to extract
The AI will take each field name and try to find and extract the most appropriate related piece of text in the document, if any
Pros of the AI engine
It supports parsing all types of documents: PDF, email, HTML, and other text documents
It can extract data from any kind of layout
It is very easy to use: simply name the fields you want to extract
It can extract data from complex table layouts
It can extract data from emails forwarded from different email clients
Cons of the AI engine
The document size is currently limited to around 10 pages. It can be much more or much less, depending on the density and language. Especially if you are parsing documents with characters in another alphabet than the Latin alphabet, you should expect fewer pages to be supported.
Although the number of fields isn't limited, the quality of the results tends to go down as more fields are added to the list
Although all languages are supported, the AI performs best on English documents
Finding the best name for a field can take a few trials and errors
AI being probabilistic by nature, it is impossible to ensure that 100% of all documents will be parsed correctly all the time
When a document isn't parsed correctly, there is no debugger to tell you what went wrong
There is currently no way to tell if a field is mandatory
Use the AI engine if...
the AI gives you correct results based on the field names you created
your documents come in many different layouts
you don't need to parse long documents
you don't necessarily need 100% accuracy for all of your documents, or you have a process in place to monitor the quality of the extracted data. Monitoring quality can be done either manually (e.g. random spot checks, human-in-the-loop) or automated with business logic rules (in Parseur's post-processing module or down the line in another tool)
Tips and more information
Check out our article about using the AI parsing engine
OCR template engine for PDFs
How it works
Create one or more templates based on sample PDFs
To create a template, draw rectangles over the pieces of text you want to extract
Create one template per document layout
When there is more than one template in a mailbox, Parseur will automatically pick the best one
Pros of the OCR engine
It can extract data from documents of any number of pages
It can extract data from as many fields as you require
It can extract data from documents in any language and alphabet
You can define for each field whether it is mandatory or optional
You can use labels to extract fields that move horizontally or vertically in the document
You can use labels to extract tables with a varying number of rows
You can add several samples to the same template to test field positioning on a variety of samples
Template parsing is deterministic. When something doesn't work, you can use the debugger to see why Parseur couldn't apply the template to a specific document
Cons of the OCR engine
You need to create one template per document layout
It only supports PDFs
It only supports extracting data from simple tables with regular columns and rows
Use the OCR template engine if...
you need a high level of data quality and accuracy
you need to extract data from a reasonable number of different PDF layouts
the tables you want to extract data from are simple, or if they are complex, you are able to rework the data down the line using post-processing or an external tool
Tips and more information
How to use labels and dynamic OCR
How to extract tables from PDFs
Text template engine for emails and text documents
How it works
Create one or more templates based on sample emails or text documents
Select the pieces of text you want to extract
Create one template per document layout
When there is more than one template in a mailbox, Parseur will automatically pick the best one
Pros of the text engine
It can extract data from documents of any number of pages
It can extract data from as many fields as you require
It can extract data from documents in any language and alphabet
It can extract data from tables, including complex ones
Template parsing is deterministic. When something doesn't work, you can use the debugger to see why Parseur couldn't apply the template to a specific document
Cons of the text engine
You need to create one template per document layout
All fields in a text template are mandatory. If you have optional fields, you will need to create several templates
When parsing emails using the text engine, the engine is sensitive to which email client was used to do the forward. It is therefore very important, so send your emails to Parseur using automated forward rules.
Extracting data from complex tables is possible but may require you to play with regular expressions for row and cell separators
When adding fields, Parseur will automatically determine the starting and stopping delimiters to locate that field in the document. While that works in most situations, it sometimes needs adjustment, especially if the text around a field changes from one document to the next. See the tips section below for more information.
You cannot add several samples to a template to test it. To test a template, you can use the debugger or reprocess documents
Use the text template engine if...
you need a high level of data quality and accuracy
you need to extract data from a reasonable number of different email layouts
you are able to send the emails directly to Parseur or set up automated forward rules
Tips and more information
How to extract tables from emails
Our recommendation: use all of them at once!
Parseur's little secret weapon is that it lets you use all three parsing engines inside the same mailbox!
How Parseur prioritizes engines
Parseur will first try to find a matching text template.
If none are found, it will search for a matching OCR template.
If none are found, it will use the AI engine