Parseur supports extracting text from text-based PDF attachments and other plain-text documents.
How does the PDFs conversion work?
There a couple of limitation working with PDFs:
- PDFs need to be text-based (i.e. not scanned)
- PDFs need to be without password protection
Parseur does not support parsing scanned PDF documents (i.e. Parseur doesn't do OCR). If this feature is important for you, let us know by upvoting the feature request here.
PDFs will be converted to a plain-text document.
By default, Parseur will preserve the layout of the document. You can change that setting (see below)
Tips for parsing PDFs and other plain-text documents with layout
In order to preserve the layout, converted plain-text documents use space characters to separate different blocks on the same line. From one document to the other, that number of spaces can vary.
When creating fields in template from PDFs and plain-text documents with layout, it is recommended to capture some spaces surrounding the fields you want to capture.
This will make Parseur more reliable for when the number of spaces around blocks of text changes. This is because Parseur uses delimiters around fields to locate them in a document (see that article for more information about how Parseur works).
How to switch basic PDF to text conversion (no layout)?
By default, Parseur will try to preserve the layout of the document. This is the best option in most situations.
If you want Parseur to only extract text, without the layout
- Go to your mailbox settings
- Change the PDF conversion format to Convert to text (basic)
- Click Save to update the settings
When changing the conversion format, you need to send the PDFs again to have them reconverted into the new format.
What are those
<!--psr-to TT123--> symbols in my plain-text templates about?
Parseur uses markers internally in the form of
<!--TT psr-123 --> to locate the fields in a template. You can safely ignore those markers while working on your template.