When using a mail and document parser, the raw data you get is not always in the format you need. This article describes how you can normalize parsed data in Parseur to get consistent and structured results.

Why normalize parsed data?

Often when extracting data from documents, text extraction is not enough. The raw data extracted from your documents may contain:

  • extra spaces,
  • extra new lines,
  • formatting code such as HTML code,
  • different formats of dates, number, names, addresses
  • different encoding standards,
  • etc.

You need to perform some normalization in order to "clean" the data and turn it into a structured data set.

By default, Parseur takes care of all this automatically: formatting code is removed by default, extra spaces are removed, text encoding is streamlined, etc. All this makes Parseur "just work" out of the box.

Sometimes, you need to go the extra mile and use data normalization based on the kind of data, for example to properly format dates, times or addresses.

Here too, Parseur has your back. Let's see how!

Normalize data using Field formats

Assign an output format to a field to control its output format. A format tells Parseur which kind of data a particular field contains and how to normalize it.

Available output formats are:

  • Multi-line Text (this is the default)
  • Single-line Text
  • Date
  • Time
  • Date and Time
  • Number
  • Full Name (Person's name)
  • Address
  • Tables
  • Linked Document

To assign an output format to a field in Parseur:

  1. Go to the Template editor
  2. Select the piece of text you want to extract and create a field
  3. Click on the Edit button next to the field name
  4. Select the format from the Output Format drop down menu
  5. Click Update to close the field settings
  6. Once you have made all your changes, save the template

For some output formats you can also choose a related input format. Unlike output formats that are global for a mailbox, input formats are specific to a template.

In the following sections, we detail how and when use each output format.

Text normalization

Parseur does most of the text sanitizing automatically. However, if offers two variations for standard Text fields, depending on whether you want to keep new lines.

Text (Multi-line) format

This is the default format when creating a field.

The "Text" format will extract all visible text from your emails, including visible new lines.

When selecting "Text" output format, you can further tweak the format by selecting an input format:

  • HTML text (default): tells Parseur that the documents contain HTML. Parseur will use HTML markup to determine line breaks and then remove all HTML markup from field result
  • Plain Text: tells Parseur that the document is text-only. Parseur will keep line breaks and any HTML markup in the original value, but it will remove consecutive spaces
  • Raw Text: tells Parseur to keep the original value as is.

Text (single-line) format

The "One line text" format will extract all visible text from your emails, excluding visible new lines. Like the text format, it will also strip out any formatting and HTML elements and just keep the text.

Use the One-line Text format if you require the result field to be on a single line and exclude any line breaks.

Same as for the multi-line format, you can further tweak the format by selecting an input format:

  • HTML text (default): tells Parseur that the documents contain HTML. Parseur will remove all HTML markup from field result
  • Plain Text: tells Parseur that the document is text-only. Parseur will remove any consecutive space but keep any HTML markup from the original value
  • Raw Text: tells Parseur to remove line breaks but otherwise keep the original value as is.

Dates normalization

Dates and times can take all kinds of shapes in your documents. Quite often, applications integrated with Parseur require date fields to be formatted in a specific way.

Parseur offers 3 types of date and time formats. They are rather self-explanatory:

Date format

The "Date" format will sanitize a field into a date. If the field contains a date and a time, only the date information part will be kept. Examples of dates recognized by Parseur:

  • 12 Jan 2018
  • 2018-1-2
  • Wed Jan 24th, 2018 1:58pm
  • 01/12/2018: this date can either be the 12th of January or the 1st of December, depending on the locale and conventions. See the section below to tell Parseur how to disambiguate that situation.

Time format

The "Time" format will sanitize a field into a time. If the field contains a date and a time, only the time information part will be kept. Examples of times recognized by Parseur:

  • 1:58pm
  • 13:58:23
  • 12h36

Date and time format

The "Date and Time" format will sanitize a field into a datetime. If the field contains no time information, 00:00:00 will be used for the time part. Examples of datetimes recognized by Parseur:

  • Wed Jan 24th, 2018 1:58pm
  • 12 Jan 2018 13:58:23
  • 2018-01-24T05:18:44.841813+00:00

Configuring date input format

You can change your default date format preferences (in user preferences):

  1. Click on your name in the navigation bar in the top right corner
  2. Click on Settings on the left menu
  3. Click on the Default format tab
  4. In Input, change your result output preferences: under Timezone, select the time zone of your documents (most likely your time zone). Under Date format as found in emails, tell Parseur how ambiguous dates should be treated like (either month first, or day first)
  5. Click on Update

Configuring date output format

Now that you've told Parseur that some fields are dates, you can specify the exact output format you would like those dates to be formatted in. The resulting fields are formatted according to your user preferences.

To change your result output preferences:

  1. Click on your name in the navigation bar in the top right corner
  2. Click on Settings on the left menu
  3. Click on the Default formats tab
  4. In Output, change your result output preferences (use this page to get the list of all available options)
  5. Click on Update

Number normalization

Parseur lets you easily parse numbers (including spaces, commas etc) into real numbers using the Number format.

Number format

The "Number" format will transform any number represented in a text into a "real" number. For instance, it will strip out any space, comma and additional formatting characters from the number.

Changing the decimal separator

By default, Parseur will use the period character (".") as the decimal separator. If your documents use the comma (","), change your user preferences. To do that, go to your user preferences and update the Decimal separator setting.

Full name normalization

Working with Person's names can be hard. On top of the usual firstname lastname sequence, some people can have a middle name, a title or choose to only leave their first name on your form. That makes parsing complex.

The Full Name format in Parseur makes it easy to automatically parse a person's name.

Example, say you have the following name in your document: Mr. Enrique S. de la Vega

Capturing that name in a field named "LeadName" with a Full Name format will give you the following result:

  • LeadName.title: Mr.
  • LeadName.first: Enrique
  • LeadName.middle: S.
  • LeadName.last: de la Vega
  • LeadName.full: Mr. Enrique S. de la Vega

Note: Our current name parsing algorithm is primarily able to parse English forms of names at the moment. If may give varying results for names that have other conventions like Slavic, Chinese or Latin names. 

Address parsing and normalization

Parseur can automatically parse and normalize an address. It can also fill in the gaps for partial addresses, determine coordinates and provide a google map link.

Important: Parsing addresses will cost one additional credit for each address parsed in a document. For example if your document contains 2 fields with an Address format, Parseur will charge 3 credits (1 credit for parsing the base document and 1 credit for parsing each of the 2 addresses).

Example: Say you have the following address in your document: 500 Chartres Street Appt 34, New Orleans

Capturing that address in a field named "Location" with an Address format will give you the following results:

  • Location.original: 500 Chartres Street Appt 34, New Orleans
  • Location.normalized: 500 Chartres St #34, New Orleans, LA 70130, USA
  • Location.number: 500
  • Location.street: Chartres St
  • Location.address1: 500 Chartres St
  • Location.address2: #34
  • Location.city: New Orleans
  • Location.zip: 70130
  • Location.county: Orleans Parish
  • Location.state: Louisiana
  • Location.state_code: LA
  • Location.country: United States
  • Location.country_code: US
  • Location.found: True
  • Location.lat: 29.9558754
  • Location.lng: -90.065056
  • Location.map: link

In case Parseur is not able to determine the address, it will set the found flag to false. Result will look like this:

  • Location.original: 220 Strangely named road
  • Location.found: False
  • Location.normalized: 220 Strangely named road

Tables and repetitive structures

Parseur can transform tables and other repetitive blocks of text into properly formatted list data. Head over to the following article for more information about suing Lists and Tables: How to extract tables from emails.

Sanitizing links and fetching linked document

The last format available, Linked Document is a particular one. This format will only work with fields containing a URL or a link to a web page. It will fetch the web page at the given URL and add it to your document queue. You can then create a template for that new document and use Parseur extract data from web pages!

Head over to the following article for more information about using Linked Document: How to parse a web page from a link in an email.

Did this answer your question?