Fetching webpage data

If you need to process a part of a webpage, this is the optimal solution.

Select 'From webpage' and 'New Task' in the 'Import Data Source' section on the dataset details page.

Select "Import Data from Web" in the dataset details page to create a new task. All imports are referred to as tasks, where multiple similar data entries can be added for processing. In this section on web data scraping, only one webpage link can be added per task.

1. Create New Task

Click the "New task" button to initiate creating a new task for importing data from a webpage.
In the popup window, first input a name for your task (up to 20 characters). This name will help you quickly identify and manage tasks in the task list.

2. Enter Webpage URL

In the "web link" box, input the webpage link you want to scrape. Ensure the link starts with http or https.
If you need to scrape multiple webpages, you can input a list page address containing pagination.

3. Configure Content Extraction

Select the extraction type:

When choosing "List Page," the system will scrape all links and content listed on the page.
When selecting "Details Page," the system focuses on extracting content from a specific page, such as news articles or product details.

4. Advanced Settings (Optional)

Set up pagination scraping:
- If your target webpage has pagination (like multiple product pages or article lists), you can configure scraping rules in the "Pagination Settings."
- Typically, pagination modules are needed for scraping. Once configured, the system will automatically scrape data from all paginated pages.
Set scraping depth:
- By default, the system scrapes only the inputted URL. If you want to crawl more levels of pages (e.g., by following links to subsequent pages), adjust the scraping depth.
- The default depth is set at 1, indicating 1 level deep.
Set scraping frequency and time:
- If you wish to periodically scrape webpage content, you can set the task's scraping frequency (e.g., hourly, daily, etc.).
- After enabling the timer, if you need to stop the customization, please manually delete it by clicking the delete button in the task list on the import page.
- Useful for scenarios like news lists where the links remain constant but the webpage content updates.

5. Get parameters

When creating a task, it is essential to configure fetch parameters to help the system understand the webpage content to be scraped.

Choose Webpage Type:

Default Types:
- List Page: Selecting this type will scrape all listed items on the page, such as article directories, product lists, etc.
- Detail Page: Choosing this type will extract detailed content from a single page, like a single article or product details.
- If you enable scraping depth, even if your task is a list page, it will display parameters for fetching detail page data.
Custom Fields:
- If you need to categorize specific extracted data into designated fields, click on "+ Add Field" and add field names and descriptions.
- For example, if a nickname needs to be scraped from a webpage, the field name key could be: nickname; field description: user nickname.
- Please use English when adding custom fields; the more detailed the description, the more accurate the extraction.

6. Output Settings

After configuring advanced settings and fetch parameters, you will also need to set up output settings to determine how the scraped data should be saved and exported.

Set Output Format:
- You can choose to save the extracted data in JSON or Markdown format. JSON format is more suitable for subsequent API program calls handling, while Markdown format is beneficial for knowledge base data processing.
Select Output Data Content:
- List Data Output:
  - If you are scraping list page content, you may choose to export only the list data.
- Detail Data Output:
  - If you are scraping detail page content, you can opt to export the detail page.
  - When the task is a list page but scraping depth is enabled, an option to output only detail pages while excluding list page data is available.

7. Save or Execute Task Immediately

Save and manually execute the task later:
- If you wish to configure the task first without initiating the scraping process immediately, you can click on the "Save and manually execute the task later" button. The task will be saved in the task list for manual initiation at a later time.
Execute the task immediately:
- If you are prepared to instantly scrape webpage data, click the "Execute the task immediately" button. The system will commence data scraping and import it into the specified dataset.

8. View Task Progress

On the import page, you can monitor the progress of your tasks in real-time. Click on the "Importing" tab to view the progress bar and detailed information of the task.
In case of task failure, the system will generate an error report to help you understand the cause of the failure and make necessary adjustments.

PreviousImport Metadata NextImporting local text processing

Last updated 5 months ago