Parseur supports extracting text from text-based PDF attachments and other plain-text documents.

How does the PDFs conversion work?

Note: we are currently beta testing a powerful new version of our parsing engine for PDFs, that also works with scanned PDFs (read our announcement). Contact us on the chat to get included on the beta!

There a couple of limitation working with PDFs:

  • PDFs need to be text-based (i.e. not scanned)

  • PDFs need to be without password protection

Parseur does not support parsing scanned PDF documents (i.e. Parseur doesn't do OCR). If this feature is important for you, let us know by upvoting the feature request here.

PDFs will be converted to a plain-text document.

By default, Parseur will preserve the layout of the document. You can change that setting (see below)

Tips for parsing PDFs and other plain-text documents with layout

In order to preserve the layout, converted plain-text documents use space characters to separate different blocks on the same line. From one document to the other, that number of spaces can vary.

When creating fields in template from PDFs and plain-text documents with layout, it is recommended to capture some spaces surrounding the fields you want to capture.

This will make Parseur more reliable for when the number of spaces around blocks of text changes. This is because Parseur uses delimiters around fields to locate them in a document (see that article for more information about how Parseur works).

What are those <!--psr-to TT123--> symbols in my plain-text templates about?

Parseur uses markers internally in the form of <!--TT psr-123 --> to locate the fields in a template. You can safely ignore those markers while working on your template.

Did this answer your question?