
Have you always wanted to automatically extract information from websites without complex scrapers and with the use of AI? Perfect, because today we’ll show you how you can easily generate descriptions based on URLs from websites with the help of OpenAI’s AI models.
Since we use n8n in its cloud version as an automation toolkit, you can simply copy the workflow via this link and adapt it to your wishes in the course of the tutorial. This makes it even easier for you to follow.
Using n8n For Workflow Automation
To build our automation we use the software n8n. n8n is an open-source workflow automation software that is also available as a cloud version on the website of n8n in a free trial period.
It makes it super easy for anyone, with or without special technical knowledge, to build automations.
If you have never had any contact with n8n, you can get an overview of how to get started with n8n here.
n8n Web Scraping – Template Overview
If you want to follow the following tutorial, the easiest way is to copy the finalized workflow via n8n template into your n8n cloud account using this link. All you need to do is sign up for a free trial period.
Update: The template has been removed from the n8n Template Directory. So that you can still use it, I have made it available here via Google Drive. You simply have to copy the content of the JSON and paste it into your workspace.
This saves you a lot of time setting up the workflow and you only have to concentrate on the authorizations to external services and the customizations that are important to you.

Defining The Workflow Trigger
Every n8n automation workflow begins with a trigger. This defines the logic according to which your workflow is started. Here you have the choice between:
- On App Event
- On A Schedule
- On Webhook Call
- On Form Submission
- Manually
- When Called By Another Workflow
- And Many More…

For our prototype, it is sufficient to use the manual trigger.
If you want to have the workflow running permanently, a different type of trigger is recommended.
In this case, for example, an On A Schedule trigger or a trigger based on changes to the data (here Google Sheets).
Get Input Data
In order to retrieve the website URLs needed to extract information about them, we need to query them from a specific source. T
his source can be Postgres, Airtable, Notion, or any other integration of your choice. However, in this template, we have used Google Sheets as our source. If you prefer a different integration, you can replace the Google Sheets node with your desired integration.
Using an app like Google Sheets allows us to store the URLs in a centralized location, making it easier for us to process them in our workflow.
If you have copied the workflow from the template, you only need to authenticate yourself with your own Google account and select the correct document including table & spreadsheet, and you can continue.

If you have made all the adjustments correctly, the workflow should pull the URLs you have entered in your table from your Google Sheets document up to this point and make them available for the next steps.
Batch Splitting
To effectively utilize the OpenAI integration in the upcoming steps, we need to split the data into batches. This is necessary to avoid any potential errors that may occur with the OpenAI integration.
Fortunately, n8n provides a handy helper node for this purpose. In the template, the data is divided into batches of 10, which is a recommended batch size. This size strikes a balance between being not too large and not too small, ensuring optimal processing.

Pull & Format Website Data
HTTP Request To The URL
To extract the information from the website using AI with OpenAI, we need to give the model an inout to work with.
Since the AI cannot easily access the website itself, we relieve it of this step by querying the HTML code of the website (which contains all the information) via HTTP request.
This is very easy with the corresponding node in n8n and with the template you only need a few to no adjustments.

You may only need to adjust the name under which your data was retrieved by n8n from the Google Sheet in the “url” field. So if your line is called Website or Domain instead of “url”, simply replace this part within the curly brackets with the correct name.
The output of your this node should be the full HTML code of the URL the node received as an input.
To enhance the workflow’s effectiveness and assist the OpenAI GPT model in filtering out unnecessary noise from the website’s HTML code, we will implement two simple data transformation steps. These steps will help streamline the data for better processing.
HTML Extract
The first step is to separate the body of the code from the rest, as only this contains the information we need, and the rest is unimportant to us.

In n8n there is the HTML Extract Node in which we simply have to enter “body” as the key, “html” as the CSS selector and “text” as the return value.
Clean Content
To make the code even easier to understand, we should now remove code-specific special characters. We do this with a code block.
The code is already included in the template and should normally work without any adjustments. Here again is the code for copy & paste:
if ($input.item.json.body){
$input.item.json.content = $input.item.json.body.replaceAll('/^\s+|\s+$/g', '').replace('/(\r\n|\n|\r)/gm', "").replace(/\s+/g, ' ')
$input.item.json.contentShort = $input.item.json.content.slice(0, 10000)
}
return $input.item
This code section replaces the specified characters such as “\n” with an empty string, effectively removing them. In addition, it creates a “short” version of the HTML code, because code that is too long could result in errors when entering it into OpenAI.
We have now prepared the content of the website in such a way that the AI can extract the information optimally and all we have to do in the next step is create the query for the AI.
AI Data Extraction
To extract the data now, we use the n8n OpenAI integration. This is a very simple way to use OpenAI models such as GPT-4. All you need is an OpenAI Platform account including an API key.

You need this API key to authenticate yourself at n8n with your OpenAI account. To do this, simply create new credentials in the OpenAI Node, save them and you can immediately start your requests to the AI.

For our purpose, we use one of OpenAI’s chat models with the Operation “Complete”. Which one you choose is up to you, but you have the choice between GPT-3.5, GPT-4 and many more.
In addition to the model we want to use, we must also define the prompts with which we instruct the model what to do. We will use 2 different types of prompts, system prompts & user prompts.
System prompts give the model basic instructions on how to function. User prompts are specific user input. Based on the system prompt, the model then responds to these user prompts.
System Prompt:
Your Input is the HTML content of a website of a company.
Your output should be a description of the services or products the company is offering.
Make the Description maximum 2 sentences. Focus on the core description of what the company does.
You can change the system prompt to suit your requirements. For example, if you want to know other things from the website data, you simply have to change the content in the System Prompt.
User Prompt:
Website Content: {{ $json["contentShort"] }}
The user prompt sets the cleaned HTML code as a variable.

Split Up The Data
Because the OpenAI integration also outputs other metadata in addition to the answer itself, we have to split out the data relevant to us in a short step.

Merge the Data
To reassign the descriptions to the correct URLs in our Google Sheets document, we now need to merge the two currently separate data sets.

We take the data from the “Split in Batches” node as input 1, and the split out descriptions as input 2. Then we merge the whole thing with the n8n merge node based on its position in the data set.
Updating the Google Data
The last step is to assign and add the descriptions generated by the AI to the respective domain in our Google Sheet in order to have all the data stored together in one place.
To do this, we use the Google Sheets integration again, this time with the “Update Row” operation. Now we select the same document from which we retrieved the data and add our description to a new column.

To do this, we match the rows based on the URL, as shown here in the screenshot.
Continue the Loop
Now we have completed the practical part of our workflow, we just need to connect the output of this node to our “Split in Batches” node so that if we have more than 10 URLs in our Google Sheet, the workflow will continue until we have descriptions for all of these URLs.
I hope you enjoyed the tutorial and find a use case where this automation can be applied.





Leave a comment