In this article we are going to describe how to use Parseur to parse a webpage from a link in an email or document.
Step 1: Create a Parseur mailbox
If you haven't already done so, create a mailbox and send you first document. Check out this page if you're unsure about how to get started.
Note: Parseur works best with machine generated emails.
Step 2: Create a template to capture the email link
Once your email is in Parseur, create your template like any other template, which is as easy as Point & Click. If necessary, refer to this page for more information about creating templates.
Here, the only information we are interested in is the link, so let's select it and create a new field.
Note: if the link URL you want to capture is inside an HTML link, switch to the Source (advanced) view to create your field and only select the URL piece inside the href attribute of the HTML link.
Now click on the edit button right of the field name, and change the format to "Linked Document"
Click Update then Create to save the template.
Two things are going to happen:
- You email is going to be parsed and the link will be extracted
- After a few seconds, you'll see the new downloaded web page appear in the document queue.
Step 3: Create a template for the fetched web page
Now create a template for the web page by clicking on the + plus button.
Click Create and...
Step 4: Watch Parseur parse a web page and profit!
Creating the template will parse your document and extract the relevant data from it.
Now, every time you send a similar email with a link, the web page will be fetched and if it matches one of your existing templates, data from the web page will be parsed and extracted automatically.
Closing remarks and limitations
- Parseur is not limited to extracting links from emails. Any field in a template with the format "Linked Document" will be used to download documents and extract data. That means that you can fetch web pages from email attachments as well as from other web pages!
- Parseur charges you for the number of successfully processed documents. Which means that fetching a web page from an email link and parsing that web page will count as 2 credits.
- The webpage behind the URL needs to be publicly accessible without needing to enter a login and password to view it
- The webpage behind the URL cannot be a "Single Page Application" (i.e. where content is dynamically downloaded after the page first loaded)
Infinite loop warning
Since the address of the document appears in the subject, try not to create a template that extract and fetch the subject as a link, it will create a new document from the same link again and again...