Using the Post Processing module of your Parseur mailbox, you can perform advanced manipulations on your parsed data such as field merge, split, date and time calculations, regular expression match and even add your own business logic!
Use the Post Processing module if you need to perform data operations not available from using standard Field formats.
Disclaimers
Post Processing lets you write Python code. You should be comfortable with basic programming and ideally know a bit about the Python programming language to use it
This module is only available on Pro plans and above.
Access the post processing module in Parseur
To access the post processing module:
open a mailbox
click on "Post Process" in the left menu.
The post processing screen looks like this:
The Post Processing screen is split into several sections:
At the top, use the Previous and Next buttons to change the base parsed data you want to use to test your code on
On the left, you have the original parsed data before the transformation
In the middle is where you write your post processing Python code.
On the right is what the data looks like after going through post processing
As you type your code, the results on the right will automatically update.
Create your first post processing code
Let's create your first code. This will add a custom field to the parsed result. This is something you can do already using Static Fields, but it will allow you to understand how Post Processing works.
Select the commented out examples and remove them
Type the following in the editor (you can copy and paste it too):
data["my_first_field"] = "Hello World!"
Wait for the results on the right to refresh
Check at the bottom of the transform data: you now have a new
my_first_field
whose content is"Hello World!"
Congratulations, you have created your first field using post processing!
Now, save this code by clicking on the Save button (or with
Ctrl+S
)Go to the document queue and reprocess a document: the new field is added to the result of that document
Open the logs: you now have 2 processed entries: the first one with the original data extracted by the template. And the second one with the data after post processing.
Check Your Code Against a Specific Document
To check your code against a specific document in Parseur, you have two options:
Use the Previous and Next buttons: Navigate between documents using these buttons to find the document you want to check.
Directly Access a Specific Document: If you want to check your code against a specific document, follow these steps:
Open the document in the Parseur app.
Look at the page URL. It will have a format like this: https://app.parseur.com/p/88888/d/123456789.
To directly access the post processing page for this document, replace the last
/d/
in the URL with/p/
. For example, change the URL to: https://app.parseur.com/p/88888/p/123456789.
This last method allows you to directly work with the specific document you need.
Writing post processing code
Programming in Python
The Post Processing module lets you write Python 3.6 code. Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace.
If you're new to Python, here are some useful links to get you started:
Examples in the section below will give you some ready-made snippets to copy & paste and adjust to your use case.
Post processing in action: manipulating the parsed data
variable
When you write code in the post processing module, you have access to the data
variable that lists all your parsed data from the document you are post processing.
The data
variable is a Python dict.:
each parsed field is an item of the dictionary.
modify the
data
in place to transform the fields.alternatively you can save your transformed data in another variable, and
return
that variable instead at the end of the processing code
You also have access to the extra
variable.
the
extra
variable is a dict that contains all metadata fields, even those not enabled in your mailbox settingsuse this variable for example if you want to extract data from the document content by accessing
extra["Content"]
orextra["HtmlDocument"]
orextra["TextDocument"]
Available functions
Here is the list of all accessible built-in functions:
abs
: Returns the absolute value of a numberall
: Returns True if all parameters evaluate to True, False otherwiseany
: Returns True if any parameter evaluates to True, False if none doesbool
: Converts the given parameter to a booleanbytes
: Converts the givenstr
parameter tobytes
callable
: ReturnsTrue
if the given parameter can be called.False
otherwisechr
: Returns the character that has the given Unicode code pointcomplex
: Returns a complex number, given its real and imaginary partsdict
: Builds adict
from named parameters or a collection of couplesdivmod
: Returns the quotient and remainder of the given numbers' divisionenumerate
: Count from 0, for each element in a given collectionformat_address(address_str)
is a convenience method that takes an address formatted as a string and returns a geolocated address object. See our address format article for more information.format_datetime(date_str, format)
is a convenience method that takes a date formatted as a string and formats it according to python datetime formatsfilter
: Keep or remove elements from a collection, according to a functionfloat
: Converts the given parameter to an inaccurate floating-point numberhasattr
: ReturnsTrue
if an object has a given attribute,False
otherwisehash
: Returns a number that uniquely identifies an objecthex
: Converts an integer into hexadecimal as a string starting with 0xid
: Returns an integer that is the identity of an objectint
: Converts the given parameter to an integerisinstance
:True
if 1st parameter is of the type given as the 2nd parameterissubclass
:True
if 1st parameter is a subclass of a class given as the 2nd parameterlen
: Returns the length of the collection given, as an integerlist
: Builds alist
from given parametersmax
: Returns the largest element from a given collectionmap
: Apply a given function to each element of a given collectionmin
: Returns the smallest element from a given collectionnext
: Returns the next element from a given iteratoroct
: Converts an integer into an octal number as a string starting with 0xord
: Returns an integer representing the Unicode code point of a given characterpow
: Returns the first argument raised to the second argument's powerrange
: Build a sequence of numbersrepr
: Returns a string that is a representation of an objectreversed
: Returns a copy of a given collection, in reverseround
: Given a number, returns the closest rounded integerslice
: Returns a slice object, that is a piece of a collectionsorted
: Returns a sorted copy of a given collectionstr
: Converts the given object to a stringsum
: Returns a sum of all values in a given collectiontuple
: Builds a tuple from given parameterszip
: Aggregates elements from each of the collections given as parameters
Available modules
In addition to the built-in functions above, and True
, False
and None
, the following modules and functions are available:
datetime
module for date, time and datetime manipulationsdateutil
module for advanced date manipulations, including parsing strings to dates, calculating deltas and working with timezonesdecimal
module for manipulating floating point numbers where precision is important, for example when working with pricesre
module for working with regular expressionsPostProcessError
custom exception to raise error messages that will appear in your logs
Limitations of the Post Processing module
The post processing module should give you everything you need to perform the most advanced data manipulations.
Note however the following limitations, especially if you are already a Python expert:
Only a subset of Python standard built-ins is included. Trying to use a non-included built-in will result in a
NameError
exception. You cannot use theimport
keyword to import additional modules. Trying to perform an import will result in aImportError
exceptionYou cannot use the
format()
method on strings. Use thef"..."
notation instead.You cannot access internal object attributes (starting with a
_
). Trying to access an internal attribute will raise an exception
Stopping the execution and preventing exports
If you want to stop the execution of a particular document during post processing, simply return None
. This will mark the document as Skipped (post process)
and won't trigger any export like Zaps or webhooks.
Handling errors and exceptions
Any exception raised during post processing will stop the post processing and mark the document as Post process failed
.
When that happens, click on the magnifying glass icon to access the logs and get more details about the error.
You can log your own error message in the log by raising the PostProcessError
exception. Example: raise PostProcessError("This user is not allowed")
.
Any exception raised while writing your code in the Post Processing module will prevent you from saving your code (except when raising PostProcessError
exceptions).
Useful keyboard shortcuts
The following shortcuts are useful to make writing your code more efficient and pleasant.
Ctrl+S
to saveCtrl+F
to searchCtrl+/
to comment out the current line or a block of codeCtrl+D
to delete the current line or block of codeCtrl+L
to go to lineTab
to indentShift+Tab
to outdent
Note: replace Ctrl
by Cmd
if using macOS.
Examples of the most common post processing use cases
Merge two or more fields
Option 1: for simple field merging use the
+
notation
data["full_name"] = data["first_name"] + ' ' + data["last_name"]
Option 2: for more complex manipulations, use the f-string notation
data["description"] = f"Notes: status {data['status'].upper()}"
Split a field into sub fields
Use the split()
method to split a field into sub fields.
Let's say the field vehicle
contains Kia, Stinger, GT (red Leather)
and you want to split this field into make
, model
and variant
. You can use the following code:
make, model, variant = data["vehicle"].split(", ")
data["make"] = make # will store Kia
data["model"] = model # will store Stinger
data["variant"] = variant # will store GT (red Leather)
Optionally, you can limit the number of splits:
make, variant = data["vehicle"].split(", ", 1)
data["make"] = make # will store Kia
data["variant"] = variant # will store Stinger, GT (red Leather)
Work with optional fields
Using the data["..."]
notation will raise a KeyError
error if the field is not present in the parsed data. This can be a problem if you have several templates, some of them with optional fields.
When this happens, you can use the data.get()
method instead.
# the following will raise an error
# if the field named "option" is not present
data["description"] = data["option"]
# Use the get() method instead
data["description"] = data.get("option")
# You can also supply a default value
data["description"] = data.get("option", "No options.")
Iterate on a table field
Use a for-loop with enumerate() to walk through a table and return the current index and value.
Example: let's say you have an items
table field with quantity
, description
, unit_price
columns and you want to add a new price
value for each item as well as a total_price
field.
data["grand_total"] = 0
for index, item in enumerate(data["items"]):
price = item["quantity"] * item["unit_price"]
data["items"][index]["price"] = price
data["grand_total"] = data["grand_total"] + price
Note: you need to make sure quantity
and unit_price
are numbers and not string. You can use the Number format for that or perform the conversion in Python directly using int()
or decimal.Decimal()
.
Work with dates and times
For most use cases, you can use the date and time field formats. However, sometimes, you may need to compute a date from different fields. In those situations, you can use the datetime
, dateutil
and format_datetime()
method to manipulate dates.
Example 1: Parse date string and convert format
Let's say you have a start_date
formatted as Sunday 1 November 2020
field and another start_time
field formatted as 1PM
and you want to create a new start_datetime
field formatted as YYYY-MM-DD HH:MM
.
Use the format_datetime(date_str, format)
convenience method to parse and format a date.
datetime_str = data["start_date"] + " " + data["start_time"]
data["start_datetime"] = format_datetime(datetime_str, "%Y-%m-%d %H:%M")
# start_datetime is now "2020-11-01 13:00"
For a list of available python date and time formats, check out the date and time field format page (at the bottom).
Example 2: Compute dates
Let's say you want to store the date for the next day, in a new field named tomorrow
.
Use dateutil.relativedelta.relativedelta
to compute a one day, positive delta, then add it to the current date and time provided by datetime.datetime.now()
.
data["tomorrow"] = (
datetime.datetime.now()
+ dateutil.relativedelta.relativedelta(days=1)
)
Work with regular expressions
Python regular expressions are (complex but) powerful methods to search for patterns in fields and perform splits and replacements.
To learn more about regexps, check out this introductory article about regular expressions in python.
Example 1: price to string (simple version)
This is a simple version that removes all text, spaces and $ symbol from a price field.
data["price"] = re.compile(r'[a-zA-Z $]+').sub('', data["price"])
Example 2: price to decimal (advanced version)
This is a full working example of how to use regular expressions to convert a price string including currency or text and potentially thousands separator into a decimal, including error management. This code creates a reusable function that can be called to convert fields at different locations in your post processing code.
REMOVE_FROM_DECIMAL = re.compile(r"[a-zA-Z\ \$£€\!\?]+")
def str_to_decimal(value):
clean_value = REMOVE_FROM_DECIMAL.sub("", value)
if clean_value == "":
return None
if "," in clean_value:
if re.search(r",\d{3}", clean_value):
# Comma as thousand separator
clean_value = clean_value.replace(",", "")
else:
# Comma as decimal separator
clean_value = clean_value.replace(",", ".")
try:
return decimal.Decimal(clean_value)
except decimal.InvalidOperation:
return None
Example 3: extract a value following a pattern
Use re.search()
along with re.group()
and a regular expression. with groups identified in parenthesis to extract the pattern.
# let assume data["room_details"] = "4 rooms, 3 bedrooms, 84 sqm"
# and we want to extract the area
area_match = re.search("(\d+) sqm", data.get("room_details", ""))
if area_match:
data["area"] = surface_match.group(1)
# => "area" is 84
Add business logic
Write conditional logic statements to trigger different behaviors depending on incoming parsed data.
Example: let's say we are a food delivery company using Parseur to manage incoming orders. Restaurants forward all orders to us but we are only interested in parsing delivery orders, not pickup orders. Also, we want to make sure to log an error if we can't determine the order type.
The type of order is extracted into the order_type
field by our templates.
if "order_type" not in data:
# something is wrong, the intern must have again forgotten
# to add the order_type field when creating this template.
# Let's stop here and log an error.
raise PostProcessError(f"Error: order_type not found in parsed data. Please check template {extra['Template']}.")
if data["order_type"].lower() == "pickup":
# all order types including pickup, Pickup or PICKUP are skipped
return None
# From here, we know the order valid and is a delivery.
# Write the rest of the post processing code here
Most common error messages and solutions
Python error messages are usually quite expressive for the seasoned programmer. When you get an error message, you will get the line number where the error was triggered, making it easier to spot the problem.
SyntaxError: invalid syntax at statement: [...]
This means the code you wrote doesn't follow the Python syntax. It can come for various reasons:
missing parenthesis, bracket or quote
assignment issues
misspelling keywords
If you can't find the reason for the syntax error, check out the following article on fixing most common python syntax errors.
IndentationError: unexpected indent at statement: [...]
An indent is a specific number of spaces or tabs denoting that a line of code is part of a particular code block.
This error means one of your blocks of code is wrongly indented. Python is a particular language in the sense that white spaces are significant. It is important that every statement from your main code block doesn't start with a space and every statement in sub blocks (for example in an if
block) are indented with the same number of spaces.
To avoid this error, use the Tab
key to indent your code consistently and the Shift+Tab
key to outdent it.
KeyError: 'key'
This can happen when:
You try to access a key that doesn't exist in a
dict
(for example, a field name that doesn't exist indata
- remember that Python is case-sensitive)You try to access an element in an array at an index that doesn't exist (for example you try to access
items[10]
but theitems
array only has 4 elements)
There are several ways to fix this error
if you are dealing with an optional field, use the
data.get("field_name")
method instead ofdata["field_name"]
If you want to test if a field is present before working on it, you can use:
if "field_name" in data:
# ... do something with data["field_name"]
If you are working with array indexes, you can check the index is valid with:
if index < len(my_array):
# ... do something with my_array[index]
NameError: name '...' is not defined
This means that the builtin, method or module you are trying to use is not available. Check out the list of available modules at the beginning of this page.
Invalid return, Error: <error_message>
This means the post process data you are returning is invalid. For a return to be valid, it must be serializable in JSON.
For example, you cannot return a Python datetime object
, because JSON doesn't have a datetime format. You need to convert it to a string
first, for example using the format_datetime()
method.