Web Crawler
The Web Crawler Source Connector allows you to integrate Vectorize's web crawler as a data source for your pipelines, ingesting unstructured textual content from specified web pages. This guide explains the configuration options available when setting up a Web Crawler source connector.
Configuration Fields
Field Summary
Field | Description | Required |
---|---|---|
Name | A descriptive name to identify the connector within Vectorize | Yes |
Seed URLs | A list of root URLs where the crawler will start the crawling process from | Yes |
Allowed URLs | Specific URLs prefixes and their subpages that are allowed to be crawled | Yes |
Configuring the Integration
From the main menu, click on Source Connectors
Click New Source Connector
Select the Web Crawler card
Fill in the required fields:
Enter a descriptive name for the connector
Add one or more Seed URLs (click + Add for multiple starting points)
Specify Allowed URLs (click + Add to include additional URLs)
Click Create Web Crawler Integration to test and save your configuration
Usage Guidelines
Seed URLs: These are the starting points for the crawler. Enter the root URLs of the websites you want to crawl.
Allowed URLs: Use this to restrict the crawler to specific URLs and their subpages. This helps in focusing the crawl and avoiding unnecessary content.
Usage in Pipelines
When configuring a pipeline, you can select this Web Crawler integration as the data source. Vectorize will then crawl and ingest content from the specified URLs according to the configuration.
Best Practices
Start with a limited set of Seed URLs to test the crawler's behavior.
Use Allowed URLs to create a focused crawl, especially for large websites.
Respect robots.txt files and website crawling policies.
Monitor the crawl process to ensure you're getting the desired content.
Troubleshooting
If you encounter issues while creating or using the integration:
Verify that your Seed URLs are accessible and correct
Ensure that Allowed URLs are properly formatted and include all necessary subdomains or paths
Check if the target websites have any crawling restrictions
For further assistance, please contact Vectorize support.
Last updated