Web Crawler

The Web Crawler Source Connector allows you to integrate Vectorize's web crawler as a data source for your pipelines, ingesting unstructured textual content from specified web pages. This guide explains the configuration options available when setting up a Web Crawler source connector.

Configuration Fields

Field Summary

FieldDescriptionRequired

Name

A descriptive name to identify the connector within Vectorize

Yes

Seed URLs

A list of root URLs where the crawler will start the crawling process from

Yes

Allowed URLs

Specific URLs prefixes and their subpages that are allowed to be crawled

Yes

Configuring the Integration

  1. From the main menu, click on Source Connectors

  2. Click New Source Connector

  3. Select the Web Crawler card

  4. Fill in the required fields:

    • Enter a descriptive name for the connector

    • Add one or more Seed URLs (click + Add for multiple starting points)

    • Specify Allowed URLs (click + Add to include additional URLs)

  5. Click Create Web Crawler Integration to test and save your configuration

Usage Guidelines

  • Seed URLs: These are the starting points for the crawler. Enter the root URLs of the websites you want to crawl.

  • Allowed URLs: Use this to restrict the crawler to specific URLs and their subpages. This helps in focusing the crawl and avoiding unnecessary content.

Usage in Pipelines

When configuring a pipeline, you can select this Web Crawler integration as the data source. Vectorize will then crawl and ingest content from the specified URLs according to the configuration.

Best Practices

  1. Start with a limited set of Seed URLs to test the crawler's behavior.

  2. Use Allowed URLs to create a focused crawl, especially for large websites.

  3. Respect robots.txt files and website crawling policies.

  4. Monitor the crawl process to ensure you're getting the desired content.

Troubleshooting

If you encounter issues while creating or using the integration:

  • Verify that your Seed URLs are accessible and correct

  • Ensure that Allowed URLs are properly formatted and include all necessary subdomains or paths

  • Check if the target websites have any crawling restrictions

For further assistance, please contact Vectorize support.

Last updated