Web Crawler

The Web Crawler Source Connector allows you to integrate Vectorize's web crawler as a data source for your pipelines, ingesting unstructured textual content from specified web pages. This guide explains the configuration options available when setting up a Web Crawler source connector.

Configuration Fields

Field Summary

Configuring the Integration

  1. From the main menu, click on Source Connectors

  2. Click New Source Connector

  3. Select the Web Crawler card

  4. Fill in the required fields:

    • Enter a descriptive name for the connector

    • Add one or more Seed URLs (click + Add for multiple starting points)

    • Specify Allowed URLs (click + Add to include additional URLs)

  5. Click Create Web Crawler Integration to test and save your configuration

Usage Guidelines

  • Seed URLs: These are the starting points for the crawler. Enter the root URLs of the websites you want to crawl.

  • Allowed URLs: Use this to restrict the crawler to specific URLs and their subpages. This helps in focusing the crawl and avoiding unnecessary content. You need to include the seed URL in the allowed URLs to ensure it's included in the crawl.

For example, your seed URL might be https://docs.vectorize.io, and the allowed URL might be https://docs.vectorize.io and https://docs.vectorize.io/integrations.

Configuring Web Crawler for RAG Pipeline

Understanding Pipeline-Specific Configuration Properties for the Web Crawler

Reindex Interval (s)

  • Description: The time interval (in seconds) at which the crawler will recrawl the pages. This setting is useful for keeping the data fresh and up-to-date.

  • Behavior:

    • If you set the interval to a low value (e.g., 120 seconds), the crawler will recrawl the pages every 2 minutes and will update the data in the vector database.

    • If you do not want the crawler to recrawl the pages, set the interval to a high value (e.g., 86400 seconds for once a day).

Best Practices

  1. Start with a limited set of Seed URLs to test the crawler's behavior.

  2. Use Allowed URLs to create a focused crawl, especially for large websites.

  3. Respect robots.txt files and website crawling policies.

  4. Monitor the crawl process to ensure you're getting the desired content.

Troubleshooting

If you encounter issues while creating or using the integration:

  • Verify that your Seed URLs are accessible and correct

  • Ensure that Allowed URLs are properly formatted and include all necessary subdomains or paths

  • Check if the target websites have any crawling restrictions

For further assistance, please contact Vectorize support.

Last updated