All Collections
Formatting and post processing data
Advanced data post processing with Python
Advanced data post processing with Python

How to add custom fields, business logic and change the structure of your parsed data writing Python code

Updated over a week ago

Using the Post Processing module of your Parseur mailbox, you can perform advanced manipulations on your parsed data such as field merge, split, date and time calculations, regular expression match and even add your own business logic!

Use the Post Processing module if you need to perform data operations not available from using standard Field formats.

Disclaimers

  • Post Processing lets you write Python code. You should be comfortable with basic programming and ideally know a bit about the Python programming language to use it

  • This module is only available on Pro plans and above.

Access the post processing module in Parseur

To access the post processing module:

  • open a mailbox

  • click on "Post Process" in the left menu.

The post processing screen looks like this:

The Post Processing screen is split into several sections:

  • At the top, use the Previous and Next buttons to change the base parsed data you want to use to test your code on

  • On the left, you have the original parsed data before the transformation

  • In the middle is where you write your post processing Python code.

  • On the right is what the data looks like after going through post processing

As you type your code, the results on the right will automatically update.

Create your first post processing code

Let's create your first code. This will add a custom field to the parsed result. This is something you can do already using Static Fields, but it will allow you to understand how Post Processing works.

  • Select the commented out examples and remove them

  • Type the following in the editor (you can copy and paste it too):

data["my_first_field"]  = "Hello World!"
  • Wait for the results on the right to refresh

  • Check at the bottom of the transform data: you now have a new my_first_field whose content is "Hello World!"

Congratulations, you have created your first field using post processing!

  • Now, save this code by clicking on the Save button (or with Ctrl+S)

  • Go to the document queue and reprocess a document: the new field is added to the result of that document

  • Open the logs: you now have 2 processed entries: the first one with the original data extracted by the template. And the second one with the data after post processing.

Writing post processing code

Programming in Python

The Post Processing module lets you write Python 3.6 code. Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace.

If you're new to Python, here are some useful links to get you started:

Examples in the section below will give you some ready-made snippets to copy & paste and adjust to your use case.

Post processing in action: manipulating the parsed data variable

When you write code in the post processing module, you have access to the data variable that lists all your parsed data from the document you are post processing.

The data variable is a Python dict.:

  • each parsed field is an item of the dictionary.

  • modify the data in place to transform the fields.

  • alternatively you can save your transformed data in another variable, and return that variable instead at the end of the processing code

You also have access to the extra variable.

  • the extra variable is a dict that contains all metadata fields, even those not enabled in your mailbox settings

  • use this variable for example if you want to extract data from the document content by accessing extra["HtmlDocument"] or extra["TextDocument"]

Available functions

Here is the list of all accessible built-in functions:

  • abs: Returns the absolute value of a number

  • all: Returns True if all parameters evaluate to True, False otherwise

  • any: Returns True if any parameter evaluates to True, False if none does

  • bool: Converts the given parameter to a boolean

  • bytes: Converts the given str parameter to bytes

  • callable: Returns True if the given parameter can be called. False otherwise

  • chr: Returns the character that has the given Unicode code point

  • complex: Returns a complex number, given its real and imaginary parts

  • dict: Builds a dict from named parameters or a collection of couples

  • divmod: Returns the quotient and remainder of the given numbers' division

  • enumerate: Count from 0, for each element in a given collection

  • format_address(address_str) is a convenience method that takes an address formatted as a string and returns a geolocated address object. See our address format article for more information.

  • format_datetime(date_str, format) is a convenience method that takes a date formatted as a string and formats it according to python datetime formats

  • filter: Keep or remove elements from a collection, according to a function

  • float: Converts the given parameter to an inaccurate floating-point number

  • hasattr: Returns True if an object has a given attribute, False otherwise

  • hash: Returns a number that uniquely identifies an object

  • hex: Converts an integer into hexadecimal as a string starting with 0x

  • id: Returns an integer that is the identity of an object

  • int: Converts the given parameter to an integer

  • isinstance: True if 1st parameter is of the type given as the 2nd parameter

  • issubclass: True if 1st parameter is a subclass of a class given as the 2nd parameter

  • len: Returns the length of the collection given, as an integer

  • list: Builds a list from given parameters

  • max: Returns the largest element from a given collection

  • map: Apply a given function to each element of a given collection

  • min: Returns the smallest element from a given collection

  • next: Returns the next element from a given iterator

  • oct: Converts an integer into an octal number as a string starting with 0x

  • ord: Returns an integer representing the Unicode code point of a given character

  • pow: Returns the first argument raised to the second argument's power

  • range: Build a sequence of numbers

  • repr: Returns a string that is a representation of an object

  • reversed: Returns a copy of a given collection, in reverse

  • round: Given a number, returns the closest rounded integer

  • slice: Returns a slice object, that is a piece of a collection

  • sorted: Returns a sorted copy of a given collection

  • str: Converts the given object to a string

  • sum: Returns a sum of all values in a given collection

  • tuple: Builds a tuple from given parameters

  • zip: Aggregates elements from each of the collections given as parameters

Available modules

In addition to the built-in functions above, and True, False and None, the following modules and functions are available:

  • datetime module for date, time and datetime manipulations

  • dateutil module for advanced date manipulations, including parsing strings to dates, calculating deltas and working with timezones

  • decimal module for manipulating floating point numbers where precision is important, for example when working with prices

  • re module for working with regular expressions

  • PostProcessError custom exception to raise error messages that will appear in your logs

Limitations of the Post Processing module

The post processing module should give you everything you need to perform the most advanced data manipulations.

Note however the following limitations, especially if you are already a Python expert:

  • Only a subset of Python standard built-ins is included. Trying to use a non-included built-in will result in a NameError exception. You cannot use the import keyword to import additional modules. Trying to perform an import will result in a ImportError exception

  • You cannot use the format() method on strings. Use the f"..." notation instead.

  • You cannot access internal object attributes (starting with a _). Trying to access an internal attribute will raise an exception

Stopping the execution and preventing exports

If you want to stop the execution of a particular document during post processing, simply return None. This will mark the document as Skipped (post process) and won't trigger any export like Zaps or webhooks.

Handling errors and exceptions

Any exception raised during post processing will stop the post processing and mark the document as Post process failed.

When that happens, click on the magnifying glass icon to access the logs and get more details about the error.

You can log your own error message in the log by raising the PostProcessError exception. Example: raise PostProcessError("This user is not allowed").

Any exception raised while writing your code in the Post Processing module will prevent you from saving your code (except when raising PostProcessError exceptions).

Useful keyboard shortcuts

The following shortcuts are useful to make writing your code more efficient and pleasant.

  • Ctrl+S to save

  • Ctrl+F to search

  • Ctrl+/ to comment out the current line or a block of code

  • Ctrl+D to delete the current line or block of code

  • Ctrl+L to go to line

  • Tab to indent

  • Shift+Tab to outdent

Note: replace Ctrl by Cmd if using macOS.

Examples of the most common post processing use cases

Merge two or more fields

  • Option 1: for simple field merging use the + notation

data["full_name"] = data["first_name"] + ' ' + data["last_name"]
  • Option 2: for more complex manipulations, use the f-string notation

data["description"] = f"Notes: status {data['status'].upper()}"

Split a field into sub fields

Use the split() method to split a field into sub fields.

Let's say the field vehicle contains Kia, Stinger, GT (red Leather) and you want to split this field into make, model and variant. You can use the following code:

make, model, variant = data["vehicle"].split(", ")
data["make"] = make # will store Kia
data["model"] = model # will store Stinger
data["variant"] = variant # will store GT (red Leather)

Optionally, you can limit the number of splits:

make, variant = data["vehicle"].split(", ", 1)
data["make"] = make # will store Kia
data["variant"] = variant # will store Stinger, GT (red Leather)

Work with optional fields

Using the data["..."] notation will raise a KeyError error if the field is not present in the parsed data. This can be a problem if you have several templates, some of them with optional fields.

When this happens, you can use the data.get() method instead.

# the following will raise an error 
# if the field named "option" is not present
data["description"] = data["option"]

# Use the get() method instead
data["description"] = data.get("option")

# You can also supply a default value
data["description"] = data.get("option", "No options.")

Iterate on a table field

Use a for-loop with enumerate() to walk through a table and return the current index and value.

Example: let's say you have an items table field with quantity, description, unit_price columns and you want to add a new price value for each item as well as a total_price field.

data["grand_total"] = 0
for index, item in enumerate(data["items"]):
price = item["quantity"] * item["unit_price"]
data["items"][index]["price"] = price
data["grand_total"] = data["grand_total"] + price

Note: you need to make sure quantity and unit_price are numbers and not string. You can use the Number format for that or perform the conversion in Python directly using int() or decimal.Decimal().

Work with dates and times

For most use cases, you can use the date and time field formats. However, sometimes, you may need to compute a date from different fields. In those situations, you can use the datetime, dateutil and format_datetime() method to manipulate dates.

Example 1: Parse date string and convert format

Let's say you have a start_date formatted as Sunday ‌1‌ ‌N‌o‌v‌e‌m‌b‌e‌r‌ ‌2‌0‌2‌0 field and another start_time field formatted as 1PM and you want to create a new start_datetime field formatted as YYYY-MM-DD HH:MM.

Use the format_datetime(date_str, format) convenience method to parse and format a date.

datetime_str = data["start_date"] + " " + data["start_time"]

data["start_datetime"] = format_datetime(datetime_str, "%Y-%m-%d %H:%M")

# start_datetime is now "2020-11-01 13:00"

For a list of available python date and time formats, check out the date and time field format page (at the bottom).

Example 2: Compute dates

Let's say you want to store the date for the next day, in a new field named tomorrow.

Use dateutil.relativedelta.relativedelta to compute a one day, positive delta, then add it to the current date and time provided by datetime.datetime.now().

data["tomorrow"] = (
datetime.datetime.now()
+ dateutil.relativedelta.relativedelta(days=1)
)

Work with regular expressions

Python regular expressions are (complex but) powerful methods to search for patterns in fields and perform splits and replacements.

To learn more about regexps, check out this introductory article about regular expressions in python.

Example 1: price to string (simple version)

This is a simple version that removes all text, spaces and $ symbol from a price field.

data["price"] = re.compile(r'[a-zA-Z $]+').sub('', data["price"])

Example 2: price to decimal (advanced version)

This is a full working example of how to use regular expressions to convert a price string including currency or text and potentially thousands separator into a decimal, including error management. This code creates a reusable function that can be called to convert fields at different locations in your post processing code.

REMOVE_FROM_DECIMAL = re.compile(r"[a-zA-Z\ \$£€\!\?]+")

def str_to_decimal(value):
clean_value = REMOVE_FROM_DECIMAL.sub("", value)
if clean_value == "":
return None
if "," in clean_value:
if re.search(r",\d{3}", clean_value):
# Comma as thousand separator
clean_value = clean_value.replace(",", "")
else:
# Comma as decimal separator
clean_value = clean_value.replace(",", ".")
try:
return decimal.Decimal(clean_value)
except decimal.InvalidOperation:
return None

Example 3: extract a value following a pattern

Use re.search() along with re.group() and a regular expression. with groups identified in parenthesis to extract the pattern.

# let assume data["room_details"] = "4 rooms, 3 bedrooms, 84 sqm"
# and we want to extract the area

area_match = re.search("(\d+) sqm", data.get("room_details", ""))
if area_match:
data["area"] = surface_match.group(1)

# => "area" is 84

Add business logic

Write conditional logic statements to trigger different behaviors depending on incoming parsed data.

Example: let's say we are a food delivery company using Parseur to manage incoming orders. Restaurants forward all orders to us but we are only interested in parsing delivery orders, not pickup orders. Also, we want to make sure to log an error if we can't determine the order type.

The type of order is extracted into the order_type field by our templates.

if "order_type" not in data:
# something is wrong, the intern must have again forgotten
# to add the order_type field when creating this template.
# Let's stop here and log an error.
raise PostProcessError(f"Error: order_type not found in parsed data. Please check template {extra['Template']}.")

if data["order_type"].lower() == "pickup":
# all order types including pickup, Pickup or PICKUP are skipped
return None

# From here, we know the order valid and is a delivery.
# Write the rest of the post processing code here

Most common error messages and solutions

Python error messages are usually quite expressive for the seasoned programmer. When you get an error message, you will get the line number where the error was triggered, making it easier to spot the problem.

SyntaxError: invalid syntax at statement: [...]

This means the code you wrote doesn't follow the Python syntax. It can come for various reasons:

  • missing parenthesis, bracket or quote

  • assignment issues

  • misspelling keywords

If you can't find the reason for the syntax error, check out the following article on fixing most common python syntax errors.

IndentationError: unexpected indent at statement: [...]

An indent is a specific number of spaces or tabs denoting that a line of code is part of a particular code block.

This error means one of your blocks of code is wrongly indented. Python is a particular language in the sense that white spaces are significant. It is important that every statement from your main code block doesn't start with a space and every statement in sub blocks (for example in an if block) are indented with the same number of spaces.

To avoid this error, use the Tab key to indent your code consistently and the Shift+Tab key to outdent it.

KeyError: 'key'

This can happen when:

  • You try to access a key that doesn't exist in a dict (for example, a field name that doesn't exist in data - remember that Python is case-sensitive)

  • You try to access an element in an array at an index that doesn't exist (for example you try to access items[10] but the items array only has 4 elements)

There are several ways to fix this error

  • if you are dealing with an optional field, use the data.get("field_name") method instead of data["field_name"]

  • If you want to test if a field is present before working on it, you can use:

if "field_name" in data:
# ... do something with data["field_name"]
  • If you are working with array indexes, you can check the index is valid with:

if index < len(my_array):
# ... do something with my_array[index]

NameError: name '...' is not defined

This means that the builtin, method or module you are trying to use is not available. Check out the list of available modules at the beginning of this page.

Invalid return, Error: <error_message>

This means the post process data you are returning is invalid. For a return to be valid, it must be serializable in JSON.

For example, you cannot return a Python datetime object, because JSON doesn't have a datetime format. You need to convert it to a string first, for example using the format_datetime() method.

Did this answer your question?