All Collections
Extracting data
AI vs template parsing: pros and cons
AI vs template parsing: pros and cons

Understanding the differences, benefits and disadvantages between the AI parsing engine and the text and OCR template engines.

Updated over a week ago

This article details the pros and cons of each parsing engine available in Parseur and guides you through which one to use.

3 parsing engines

Parseur has three parsing engines available to you:

  • an AI-based engine that works with all types of documents

  • a template-based OCR engine that is specifically made for parsing PDFs

  • a template-based text engine for parsing emails and text documents, but that actually works for all types of documents.

You can check what engines are available by looking at the upper right corner of a screen when viewing a document:

  • โœ… means the engine is supported for this type of document

  • โŒ means the engine is not supported for this type of document

  • โธ๏ธ means the engine is not enabled in this mailbox. You can enable it in the mailbox settings.

In the example below, we have an email sitting in a Parseur mailbox with AI disabled, so only the text parsing engine is available:

You can hover over each badge to get more information.

AI-based parsing engine

How it works

  1. For this engine to work, it needs to be enabled at mailbox creation or in your mailbox settings

  2. With the AI engine, use field names to describe the data you want to extract

  3. The AI will take each field name and try to find and extract the most appropriate related piece of text in the document, if any

Pros of the AI engine

  • It supports parsing all types of documents: PDF, email, HTML, and other text documents

  • It can extract data from any kind of layout

  • It is very easy to use: simply name the fields you want to extract

  • It can extract data from complex table layouts

  • It can extract data from emails forwarded from different email clients

Cons of the AI engine

  • The document size is currently limited to around 10 pages. It can be much more or much less, depending on the density and language. Especially if you are parsing documents with characters in another alphabet than the Latin alphabet, you should expect fewer pages to be supported.

  • Although the number of fields isn't limited, the quality of the results tends to go down as more fields are added to the list

  • Although all languages are supported, the AI performs best on English documents

  • Finding the best name for a field can take a few trials and errors

  • AI being probabilistic by nature, it is impossible to ensure that 100% of all documents will be parsed correctly all the time

  • When a document isn't parsed correctly, there is no debugger to tell you what went wrong

  • There is currently no way to tell if a field is mandatory

Use the AI engine if...

  • the AI gives you correct results based on the field names you created

  • your documents come in many different layouts

  • you don't need to parse long documents

  • you don't necessarily need 100% accuracy for all of your documents, or you have a process in place to monitor the quality of the extracted data. Monitoring quality can be done either manually (e.g. random spot checks, human-in-the-loop) or automated with business logic rules (in Parseur's post-processing module or down the line in another tool)

Tips and more information

OCR template engine for PDFs

How it works

  1. Create one or more templates based on sample PDFs

  2. To create a template, draw rectangles over the pieces of text you want to extract

  3. Create one template per document layout

  4. When there is more than one template in a mailbox, Parseur will automatically pick the best one

Pros of the OCR engine

  • It can extract data from documents of any number of pages

  • It can extract data from as many fields as you require

  • It can extract data from documents in any language and alphabet

  • You can define for each field whether it is mandatory or optional

  • You can use labels to extract fields that move horizontally or vertically in the document

  • You can use labels to extract tables with a varying number of rows

  • You can add several samples to the same template to test field positioning on a variety of samples

  • Template parsing is deterministic. When something doesn't work, you can use the debugger to see why Parseur couldn't apply the template to a specific document

Cons of the OCR engine

  • You need to create one template per document layout

  • It only supports PDFs

  • It only supports extracting data from simple tables with regular columns and rows

Use the OCR template engine if...

  • you need a high level of data quality and accuracy

  • you need to extract data from a reasonable number of different PDF layouts

  • the tables you want to extract data from are simple, or if they are complex, you are able to rework the data down the line using post-processing or an external tool

Tips and more information

Text template engine for emails and text documents

How it works

  1. Create one or more templates based on sample emails or text documents

  2. Select the pieces of text you want to extract

  3. Create one template per document layout

  4. When there is more than one template in a mailbox, Parseur will automatically pick the best one

Pros of the text engine

  • It can extract data from documents of any number of pages

  • It can extract data from as many fields as you require

  • It can extract data from documents in any language and alphabet

  • It can extract data from tables, including complex ones

  • Template parsing is deterministic. When something doesn't work, you can use the debugger to see why Parseur couldn't apply the template to a specific document

Cons of the text engine

  • You need to create one template per document layout

  • All fields in a text template are mandatory. If you have optional fields, you will need to create several templates

  • When parsing emails using the text engine, the engine is sensitive to which email client was used to do the forward. It is therefore very important, so send your emails to Parseur using automated forward rules.

  • Extracting data from complex tables is possible but may require you to play with regular expressions for row and cell separators

  • When adding fields, Parseur will automatically determine the starting and stopping delimiters to locate that field in the document. While that works in most situations, it sometimes needs adjustment, especially if the text around a field changes from one document to the next. See the tips section below for more information.

  • You cannot add several samples to a template to test it. To test a template, you can use the debugger or reprocess documents

Use the text template engine if...

  • you need a high level of data quality and accuracy

  • you need to extract data from a reasonable number of different email layouts

  • you are able to send the emails directly to Parseur or set up automated forward rules

Tips and more information

Our recommendation: use all of them at once!

Parseur's little secret weapon is that it lets you use all three parsing engines inside the same mailbox!

How Parseur prioritizes engines

  1. Parseur will first try to find a matching text template.

  2. If none are found, it will search for a matching OCR template.

  3. If none are found, it will use the AI engine

Did this answer your question?