Parseur can extract text from most text-based documents such as PDFs, CSVs and Microsoft Word docx.
Note: The rest of this article assumes you are familiar with Parseur basics. Click here to get started if not.
How to extract text and parse email attachments?
When choosing the type of mailbox, make sure to choose emails and attachments as shown below:
When sending an email with attachments, Parseur will create a separate document for each attachment. You can then create a template for those attachments, like you would do for any document in Parseur.
How to extract text from PDF attachments?
Parseur supports extracting text from PDF documents. PDFs need to be text based and without password protection. Parseur does not support parsing scanned PDF documents at this point (i.e. Parseur doesn't do OCR).
Attachments will be converted to a text document. By default, Parseur will preserve the layout of the document.
Tips for parsing PDF documents with layout
In order to preserve the layout, converted PDF documents use space characters to separate different blocks on the same line. From one document to the other, that number of spaces can vary.
Parseur uses delimiters around fields to locate them in a document (see that article for more information about how Parseur works).
When creating fields in PDF documents with layout, it is recommended to capture some spaces surrounding the fields you want to capture. This will make Parseur more reliable for when the number of spaces around blocks of text changes.
Basic PDF to text conversion
By default, Parseur will try to preserve the layout of the document. This is the best option in most situations.
If you want Parseur to only extract text, without the layout, go to your mailbox settings and change the PDF conversion settings.
Select "Convert to text (basic)" to get rid of the PDF layout altogether
How to parse and consolidate CSV and Excel attachments?
Parseur can automatically combine CSV and Excel files sent by email without even creating a template. Parseur will combine the files based on their column headers.
All you have to do is send your spreadsheets as email attachments to your Parseur mailbox.
Parseur will store the parsed result in the "Sheet" field.
How to access the attachment in its original (binary) format?
The Attachments Extra Field allows to download attachments in their original format.
The Attachments field is not enabled by default. Enable the Attachments field in your Parseur mailbox options, under the Fields > Extra Fields section:
Note: To add the Attachments field to already processed documents, reprocess those documents after enabling the field.
From now on, all documents that you process will contain a new Attachments entry. Attachments entry is a list of objects, each attachment object has 4 properties:
- name: The name of the file
- url: A public link to download the file content (A warning: anyone that has the link can access the file directly, without needing a password)
- content_type: The type of content in MIME format. Parseur supports all content types
- size: The size of the attachment, in bytes. Parseur can store files up to 35MB in size. Note that your email provider probably limits the attachments size too.
From a technical standpoint, the list of attachments is represented as a JSON array. If you want to manage attachments with a custom webhook, result looks like this:
How to upload original attachments to a your cloud storage or app?
Once Attachments extra field is enabled (see previous section), you can use the URL with any Zapier connector that supports files (such as Google Drive, Dropbox etc.).
To do so, map the attachment URL with the file field in your Zap.
Zapier will download the attachment and upload it into your favorite app.
How to keep the relationship between emails and attachments?
Sometimes you need to extract text from both the email and its attachments and you want to be able to make a link between those two sets of parsed data.
While Parseur processes every email and attachment document independently, it remembers the email every attachment belongs to and you can expose the link using Parseur DocumentID and ParentID Extra Fields.
An attachment ParentID will be the same as the email DocumentID it was attached to.
To enable DocumentID and ParentID those extra fields:
- Open your Parseur mailbox
- Click on the Fields section
- Scroll down to the Extra Fields panel
- Check the DocumentID and ParentID fields
Check out the following article to learn more about using Extra Fields in Parseur.
How to disable attachment parsing?
By default, when you create a new mailbox, Parseur will also parse every email attachments.
If you would like to disable attachment parsing, go to your mailbox settings and uncheck the attachment parsing box.
List of all document formats are supported in Parseur
Parseur can extract text from most attachments, as long as they are in a text format.
Here is the list of supported document formats that you can extract text from:
- abw: AbiWord Document
- csv: Comma Separated Value
- djvu: DjVu Document
- doc: Microsoft Word
- docm: Microsoft Office Open XML with Macros Enabled
- docx: Microsoft Office Open XML
- html: HTML Document
- htm: HTML Document
- lwp: Lotus Word Pro
- md: Markdown Documentation File
- odt: ODF Text Document
- pages: Pages Document
- pages.zip: Zipped Pages Document
- pdf: Portable Document File (text-based only, not scanned)
- rst: reStructuredText
- rtf: Rich Text Format
- sdw: StarWriter 5.0
- tex: LaTeX Source Document
- txt: Text Document
- wpd: WordPerfect Document
- wps: Microsoft Works Document
- xls: Microsoft Excel Document
- xlsx: Microsoft Excel Document Open XML
- xlsm: Microsoft Excel Document Open XML with Macros Enabled
- zabw: Compressed AbiWord Document