All Collections
Extracting data
Advanced topics
Download and parse a webpage from a link in a document
Download and parse a webpage from a link in a document

How to automatically download a webpage and parse its content

Updated over a week ago

In this article, we are going to describe how to use Parseur to parse a webpage from a link in an email or document.
โ€‹

Step 1: Create a Parseur mailbox

If you haven't already done so, create a mailbox and send your first document. Check out this page if you're unsure about how to get started.

Note: Parseur works best with machine-generated emails.

Step 2: Create a template to capture the email link

Once your email is in Parseur, create your template like any other template, which is as easy as Point & Click. If necessary, refer to this page for more information about creating templates.

Here, the only information we are interested in is the link, so let's select it and create a new field.

Just so you know, if the link URL you want to capture is inside an HTML link, you can just switch to the Source (advanced) view to create your field and only select the URL piece inside the href attribute of the HTML link.

Now click on the edit button right of the field name, and change the format to "Linked Document"

Click Update, then Create, to save the template.

Two things are going to happen:

  1. Your email is going to be parsed and the link will be extracted

  2. After a few seconds, you'll see the newly downloaded web page appear in the document queue.

Step 3: Create a template for the fetched web page

Now create a template for the web page by clicking on the + plus button.

Click Create and...

Step 4: Watch Parseur parse a web page and profit!

Creating the template will parse your document and extract the relevant data.

Now, every time you send a similar email with a link, the web page will be fetched and if it matches one of your existing templates, data from the web page will be parsed and extracted automatically.
โ€‹

Closing remarks and limitations

  • Parseur is not limited to extracting links from emails. Any field in a template with the format "Linked Document" will be used to download documents and extract data. That means that you can fetch web pages from email attachments as well as from other web pages!

  • Parseur charges you for the number of successfully processed documents, which means that fetching a web page from an email link and parsing that web page will count as 2 credits. If all you need from the original email is the link and nothing else, you can set that template as a Skip template: Skip templates don't consume credits and don't trigger an export, but it will still download the document behind the link.

Known limitations of the Linked Document feature:

  • The document behind the URL needs to be publicly accessible without needing to enter a login and password to view it

  • The webpage behind the URL cannot be a "Single Page Application" (i.e. where content is dynamically downloaded using Javascript after the page first loaded)

  • If you don't see the downloaded document in your document list, check the logs of the original document to get more information about why it couldn't be downloaded. Contact us if you have any questions.

Infinite loop warning

Since the address of the document appears in the subject, try not to create a template that extracts and fetches the subject as a link, it will create a new document from the same link again and again.

Did this answer your question?