When using a mail and document parser, the raw data you get is not always in the format you need. This article describes how you can normalize parsed data in Parseur to get consistent and structured results.
Why normalize parsed data?
Often when extracting data from documents, text extraction is not enough. The raw data extracted from your documents may contain:
- extra spaces,
- extra new lines,
- formatting code such as HTML code,
- different formats of dates, number, names, addresses
- different encoding standards,
You need to perform some normalization in order to "clean" the data and turn it into a structured data set.
By default, Parseur takes care of all this automatically: formatting code is removed by default, extra spaces are removed, text encoding is streamlined, etc. All this makes Parseur "just work" out of the box.
Sometimes, you need to go the extra mile and use data normalization based on the kind of data, for example to properly format dates, times or addresses.
Here too, Parseur has your back. Let's see how!
Normalize data using Field formats
Assign an output format to a field to control its output format. A format tells Parseur which kind of data a particular field contains and how to normalize it.
To assign an output format to a field in Parseur:
- Go to the Template editor
- Select the piece of text you want to extract and create a field
- Click on the Edit button next to the field name
- Select the format from the Output Format drop down menu
- Click Update to close the field settings
- Once you have made all your changes, save the template
For some output formats you can also choose a related input format. Unlike output formats that are global for a mailbox, input formats are specific to a template.
List of available formats
Parseur offers several formats to convert your parsed data into a consistent format.
- Text based formats: read more about formatting text
- Date and time formats: read more about converting and formatting dates and times
- Number format: read more about formatting numbers
- Full name format: read more about extracting a person's title, first, middle and last name
- Address format: read more about converting a location address into street, city, state, zip etc.
- Table format: read more about extracting tabular and repetitive data
- Linked Document format: read more about downloading and parsing a document behind a link