When using a mail and document parser, the raw data you get is not always in the format you need. This article describes how you can normalize parsed data in Parseur to get consistent and structured results.
Why normalize parsed data?
Often when extracting data from documents, text extraction is not enough. The raw data extracted from your documents may contain:
extra spaces,
extra new lines,
formatting code such as HTML code,
different formats of dates, numbers, names, addresses
different encoding standards,
etc.
You need to perform some normalization in order to "clean" the data and turn it into a structured data set.
By default, Parseur takes care of all this automatically: formatting code is removed by default, extra spaces are removed, text encoding is streamlined, etc. All this makes Parseur "just work" out of the box.
Sometimes, you need to go the extra mile and use data normalization based on the kind of data, for example, to properly format dates, times or addresses.
Here too, Parseur has your back. Let's see how!
Normalize data using Field formats
Assign an output format to a field to control its output format. A format tells Parseur which kind of data a particular field contains and how to normalize it.
To assign an output format to a field in Parseur:
Go to the Template editor
Select the piece of text you want to extract and create a field
Click on the Edit button next to the field name
Select the format from the Output Format drop-down menu
Click Update to close the field settings
Once you have made all your changes, save the template
For some output formats, you can also choose a related input format. Unlike output formats that are global for a mailbox, input formats are specific to a template.
List of available formats
Parseur offers several formats to convert your parsed data into a consistent format.
Text-based formats: read more about formatting text
Date and time formats: read more about converting and formatting dates and times
Number format: read more about formatting numbers
Full name format: read more about extracting a person's title, first, middle and last name
Linked Document format: read more about downloading and parsing a document behind a link
Table format: read more about extracting tables from emails and text documents and extracting tables from PDFs
Note: Table fields inside a Table field isn't currently supported