Firecrawl
Last updated
Last updated
The Firecrawl Source Connector allows you to integrate Firecrawl as a data source for your pipelines, ingesting unstructured textual content from specified web pages. This guide explains the configuration options available when setting up a Firecrawl source connector.
Before starting, you'll need:
The API key for your Firecrawl account.
To configure a connector to your Confluence instance:
Click Source Connectors from the main menu.
Click New Source Connector from the Source Connectors page.
Select the Firecrawl card.
Enter the integration name and your Firecrawl API key, then click Create Firecrawl Integration.
You can think of the Firecrawl connector as having two parts to it. The first is authorization with your API key. This part is re-usable across pipelines and allows you to connect to this same account in different pipelines without providing the credentials every time.
The second part is the configuration that's specific to your RAG Pipeline, which allows you to specify which website(s) to crawl, and what data to retrieve.
Vectorize supports both Firecrawl's /crawl
and /scrape
endpoints.
The /crawl
endpoint navigates through a website, starting from a base URL, to extract content from multiple linked pages.
The /scrape
endpoint extracts content from a single specified URL.
Set the endpoint to Crawl, enter the JSON configuration, then click Save Configuration.
Here are some common configuration examples for different use cases. You can learn more about parameter options in Firecrawl's /crawl endpoint documentation.
Basic Documentation Site Crawl
This basic example crawls all Vectorize documentation, but limits the page depth to 25, and the total number of pages crawled to 100.
Path restrictions
This example restricts the crawl to those Vectorize documentation pages matching the "integrations/" path.
This example restricts the crawl to those Vectorize documentation pages that do not match the "integrations/" path.
Set the endpoint to Scrape, enter the JSON configuration, then click Save Configuration.
Here are some common configuration examples for different use cases. You can learn more about parameter options in Firecrawl's /scrape endpoint documentation.
Basic Page Scrape
This basic example scrapes a single page, with no restrictions.
Limited Page Scrape
This basic example scrapes a single page, restricting the formats to Markdown and HTML.
If you haven't yet built a connector to your vector database, go to Configuring Vector Database Connectors and select the platform you prefer to use for storing output vectors.
OR
If you're ready to start producing vector embeddings from your input data, head to Pipeline Basics. Select your new connector as the data source to use it in your pipeline.