When using a mail and document parser, the raw data you get is not always in the format you need. This article describes how you can normalize parsed data in Parseur to get consistent and structured results.

Why normalize parsed data?

Often when extracting data from documents, text extraction is not enough. The raw data extracted from your documents may contain:

  • extra spaces,
  • extra new lines,
  • formatting code such as HTML code,
  • different formats of dates, number, names, addresses
  • different encoding standards,
  • etc.

You need to perform some normalization in order to "clean" the data and turn it into a structured data set.

By default, Parseur takes care of all this automatically: formatting code is removed by default, extra spaces are removed, text encoding is streamlined, etc. All this makes Parseur "just work" out of the box.

Sometimes, you need to go the extra mile and use data normalization based on the kind of data, for example to properly format dates, times or addresses.

Here too, Parseur has your back. Let's see how!

Normalize data using Field formats

Assign an output format to a field to control its output format. A format tells Parseur which kind of data a particular field contains and how to normalize it.

To assign an output format to a field in Parseur:

  1. Go to the Template editor
  2. Select the piece of text you want to extract and create a field
  3. Click on the Edit button next to the field name
  4. Select the format from the Output Format drop down menu
  5. Click Update to close the field settings
  6. Once you have made all your changes, save the template

For some output formats you can also choose a related input format. Unlike output formats that are global for a mailbox, input formats are specific to a template.

List of available formats

Parseur offers several formats to convert your parsed data into a consistent format.

Did this answer your question?