Skip to main content
All CollectionsExtracting dataAdvanced topics
Download and parse a webpage from a link in a document
Download and parse a webpage from a link in a document

How to automatically download a webpage and parse its content

Updated this week

In this article, we are going to describe how to use Parseur to parse a webpage from a link in an email or document.
​

Step 1: Create a Parseur mailbox

If you haven't already done so, create a mailbox and upload your first document.
​

Check out this page if you're unsure about how to get started.

Step 2: Create a template to capture the email link

Once your email is in Parseur, create your template like any other template, which is as easy as Point & Click. If necessary, refer to this page for more information about creating templates.

Here, the only information we are interested in is the link, so let's select it and create a new field.

Just so you know, if the link URL you want to capture is inside an HTML link, you can just switch to the Source (advanced) view to create your field and only select the URL piece inside the href attribute of the HTML link.

Now click on the edit button right of the field name, and change the format to "Linked Document"

Click Update, then Create, to save the template.

Two things are going to happen:

  1. Your email is going to be parsed and the link will be extracted

  2. After a few seconds, you'll see the newly downloaded web page appear in the document queue.

Step 3: Create a template for the fetched web page

Now create a template for the web page by clicking on the + plus button.

Click Create and...

Step 4: Watch Parseur parse a web page and profit!

Creating the template will parse your document and extract the relevant data.

Now, every time you send a similar email with a link, the web page will be fetched and if it matches one of your existing templates, data from the web page will be parsed and extracted automatically.
​

Closing remarks

  • Parseur is not limited to extracting links from emails. Any field in a template with the format "Linked Document" will be used to download documents and extract data. That means that you can fetch web pages from email attachments as well as from other web pages!

  • Parseur charges you for the number of successfully processed documents, which means that fetching a web page from an email link and parsing that web page will count as 2 credits. If all you need from the original email is the link and nothing else, you can set that template as a Skip template: Skip templates don't consume credits and don't trigger an export, but it will still download the document behind the link.

Known limitations of the Linked Document feature:

  • Parseur cannot extract links from PDF documents; only HTML and plain text documents.

  • The document behind the URL needs to be publicly accessible without needing to enter a login and password to view it

  • The webpage behind the URL cannot be a "Single Page Application" (i.e. where content is dynamically downloaded using Javascript after the page first loaded)

  • If you don't see the downloaded document in your document list, check the logs of the original document to get more information about why it couldn't be downloaded. Contact us if you have any questions.

I found an error in the document logs saying 'Recursive download counter exceeded a safe threshold'. What does this mean?

This means Parseur has reached the depth of 3 downloads from the original document and stopped there, to prevent an infinite loop of downloads. If this happens you will want to modify your field to look for a URL that does not link to itself.

Did this answer your question?