Web Crawler
The Web Crawler Source Connector allows you to integrate Vectorize's web crawler as a data source for your pipelines, ingesting unstructured textual content from specified web pages. This guide explains the configuration options available when setting up a Web Crawler source connector.
Configuration Fields
Field Summary
Configuring the Integration
From the main menu, click on Source Connectors
Click New Source Connector
Select the Web Crawler card
Fill in the required fields:
Enter a descriptive name for the connector
Add one or more Seed URLs (click + Add for multiple starting points)
Specify Allowed URLs (click + Add to include additional URLs)
Click Create Web Crawler Integration to test and save your configuration
Usage Guidelines
Seed URLs: These are the starting points for the crawler. Enter the root URLs of the websites you want to crawl.
Allowed URLs: Use this to restrict the crawler to specific URLs and their subpages. This helps in focusing the crawl and avoiding unnecessary content. You need to include the seed URL in the allowed URLs to ensure it's included in the crawl.
For example, your seed URL might be https://docs.vectorize.io, and the allowed URL might be https://docs.vectorize.io and https://docs.vectorize.io/integrations.
Configuring Web Crawler for RAG Pipeline
Understanding Pipeline-Specific Configuration Properties for the Web Crawler
Reindex Interval (s)
Description: The time interval (in seconds) at which the crawler will recrawl the pages. This setting is useful for keeping the data fresh and up-to-date.
Behavior:
If you set the interval to a low value (e.g., 120 seconds), the crawler will recrawl the pages every 2 minutes and will update the data in the vector database.
If you do not want the crawler to recrawl the pages, set the interval to a high value (e.g., 86400 seconds for once a day).
Best Practices
Start with a limited set of Seed URLs to test the crawler's behavior.
Use Allowed URLs to create a focused crawl, especially for large websites.
Respect robots.txt files and website crawling policies.
Monitor the crawl process to ensure you're getting the desired content.
Troubleshooting
If you encounter issues while creating or using the integration:
Verify that your Seed URLs are accessible and correct
Ensure that Allowed URLs are properly formatted and include all necessary subdomains or paths
Check if the target websites have any crawling restrictions
For further assistance, please contact Vectorize support.
Last updated