Firecrawl Source Connector

The Firecrawl Source Connector allows you to integrate Firecrawl as a data source for your pipelines, ingesting unstructured textual content from specified web pages. This guide explains the configuration options available when setting up a Firecrawl source connector.

Before you begin

Before starting, you'll need:

The API key for your Firecrawl account.

Configure the Connector

To configure a connector to your Confluence instance:

Click Source Connectors from the main menu.
Click New Source Connector from the Source Connectors page.
Select the Firecrawl card.
Enter the integration name and your Firecrawl API key, then click Create Firecrawl Integration.

Configuring the Firecrawl Connector in a RAG Pipeline

You can think of the Firecrawl connector as having two parts to it. The first is authorization with your API key. This part is re-usable across pipelines and allows you to connect to this same account in different pipelines without providing the credentials every time.

The second part is the configuration that's specific to your RAG Pipeline, which allows you to specify which website(s) to crawl, and what data to retrieve.

Vectorize supports both Firecrawl's /crawl and /scrape endpoints.

The /crawl endpoint navigates through a website, starting from a base URL, to extract content from multiple linked pages.

The /scrape endpoint extracts content from a single specified URL.

Configuring the /crawl Endpoint

Set the endpoint to Crawl, enter the JSON configuration, then click Save Configuration.

Configuring Firecrawl for RAG Pipeline

Example Configurations

Here are some common configuration examples for different use cases. You can learn more about parameter options in Firecrawl's /crawl endpoint documentation.

Basic Documentation Site Crawl

This basic example crawls all Vectorize documentation, but limits the page depth to 25, and the total number of pages crawled to 100.

{ 
   "url": "https://docs.vectorize.io/",
   "maxDepth": 25,
   "limit": 100 
}

Path restrictions

This example restricts the crawl to those Vectorize documentation pages matching the "integrations/" path.

{ 
   "url": "https://docs.vectorize.io/",
   "includePaths": ["integrations/*"] 
}

This example restricts the crawl to those Vectorize documentation pages that do not match the "integrations/" path.

{ 
   "url": "https://docs.vectorize.io/",
   "excludePaths": ["integrations/*"] 
}

Configuring the /scrape Endpoint

Set the endpoint to Scrape, enter the JSON configuration, then click Save Configuration.

Configuring Firecrawl for RAG Pipeline

Example Configurations

Here are some common configuration examples for different use cases. You can learn more about parameter options in Firecrawl's /scrape endpoint documentation.

Basic Page Scrape

This basic example scrapes a single page, with no restrictions.

{ 
   "url": "https://docs.vectorize.io/"
}

Limited Page Scrape

This basic example scrapes a single page, restricting the formats to Markdown and HTML.

{ 
   "url": "https://docs.vectorize.io/",
      "formats": ["markdown", "html"]
}

What's next?

If you haven't yet built a connector to your vector database, go to Configuring Vector Database Connectors and select the platform you prefer to use for storing output vectors.

OR
If you're ready to start producing vector embeddings from your input data, head to Pipeline Basics. Select your new connector as the data source to use it in your pipeline.

Before you begin​

Configure the Connector​

Configuring the Firecrawl Connector in a RAG Pipeline​

Configuring the /crawl Endpoint​

Example Configurations​

Basic Documentation Site Crawl​

Path restrictions​

Configuring the /scrape Endpoint​

Example Configurations​

Basic Page Scrape​

Limited Page Scrape​

What's next?​

Before you begin

Configure the Connector

Configuring the Firecrawl Connector in a RAG Pipeline

Configuring the /crawl Endpoint

Example Configurations

Basic Documentation Site Crawl

Path restrictions

Configuring the /scrape Endpoint

Example Configurations

Basic Page Scrape

Limited Page Scrape

What's next?