Create a RAG Pipeline with SingleStore and a Web Crawler

Approximate time to complete: 5-10 minutes, excluding prerequisites

This quickstart will walk you through creating and scheduling a pipeline that uses a web crawler to ingest data from the Vectorize documentation, creates vector embeddings using an OpenAI embedding model, and writes the vectors to a SingleStore vector database.

Before you begin

Before starting, ensure you have access to the credentials, connection parameters, and API keys as appropriate for the following:

A Vectorize account (Create one free here ↗ )
An OpenAI API Key (How to article)
A SingleStore account (Create one on SingleStore ↗ )

Step 1: Create a SingleStore Deployment

SingleStore offers two types of workspaces. Starter Workspaces are best for small-scale or experimental projects, while Standard Workspaces are designed for applications that need higher resources, scalability, and support for production environments.

When you create your SingleStore account, a Starter Workspace and a database will be created and deployed for you. You can use these for this quickstart, or can create a Standard Workpace to work with instead.

If you're using a Starter Workspace:

The Starter Workspace and database are automatically created when you create your SingleStore account.
Select the workspace, then click Access.
Save the username, then click the 3 dots and select Reset Password. Copy and securely save the password.
Go to the Overview tab, then click Connect and select Your App.
Save the host and the port from the connection string. You'll use these when you create your RAG pipeline in Vectorize.

If you're using a Standard Workspace:

Navigate to the SingleStore Cloud Portal and click + New Deployment.
Name your Workspace Group, select your cloud provider and region, and click Next.
Name the Workspace, optionally adjust the size and settings, then click Create Workspace.
If you're on SingleStore's free trial, a database containing MarTech data will be automatically added to the workspace you just created. You can ignore this for the purpose of the Vectorize quickstart. If you'd like to remove it, click on the 3 dots, then click Drop Database.
Click + Create Database.
Name your database, make sure it's attaching to the correct workspace, then click + Create Database.
Go to your workspace, then click Access.
Copy and save the username. Click Reset Password to set the password, then securely save the password.
Go to Workspaces, click the 3 dots, then click Connect Directly.
Save the host and the port from the connection string. You'll use these when you create your RAG pipeline in Vectorize.

Step 2: Create a RAG Pipeline on Vectorize

Create a New RAG Pipeline

Open the Vectorize Application Console ↗
From the dashboard, click on + New RAG Pipeline under the "RAG Pipelines" section.
Enter a name for your pipeline. For example, you can name it quickstart-pipeline.
Click on + New Vector DB to create a new vector database.
Select Singlestore from the list of vector databases.
In the Singlestore configuration screen, enter the parameters in the form using the SingleStore Parameters table below as a guide, then click Create SingleStore Integration.
SingleStore Parameters
Field
Description
Required
Name
A descriptive name to identify the integration within Vectorize.
Yes
Host
The host URL from your workspace's connection string.
Yes
Port
The port from your workspace's connection string.
Yes
Database
The name of your database.
Yes
Username
The username to access your database.
Yes
Password
The password you'll use to access this database.
Yes

Configure AI Platform

Click on + New AI Platform.
Select OpenAI from the AI platform options.
In the OpenAI configuration screen:
- Enter a descriptive name for your OpenAI integration.
- Enter your OpenAI API Key.
Leave the default values for embedding model, chunk size, and chunk overlap for the quickstart.

Add Source Connectors

Click on Add Source Connector.

Choose the type of source connector you'd like to use. In this example, select Web Crawler.

Configure Web Crawler Integration

Name your web crawler source connector, e.g., vectorize-docs.
Set Seed URL(s) to https://docs.vectorize.io.

Click Create Web Crawler Integration to proceed.

Configure Web Crawler Pipeline

Accept all the default values for the web crawler pipeline configuration:
- Throttle Wait Between Requests: 500 ms
- Maximum Error Count: 5
- Maximum URLs: 1000
- Maximum Depth: 50
- Reindex Interval: 3600 seconds

Click Save Configuration.

Verify Source Connector and Schedule Pipeline

Verify that your web crawler connector is visible under Source Connectors.
Click Next: Schedule RAG Pipeline to continue.

Schedule RAG Pipeline

Accept the default schedule configuration
Click Create RAG Pipeline.

Step 4: Monitor and Test Your Pipeline

Monitor Pipeline Creation and Backfilling

The system will now create, deploy, and backfill the pipeline.
You can monitor the status changes from Creating Pipeline to Deploying Pipeline and Starting Backfilling Process.

Once the initial population is complete, the RAG pipeline will begin crawling the Vectorize docs and writing vectors to your Pinecone index.

View RAG Pipeline Status

Once the website crawling is complete, your RAG pipeline will switch to the Listening state, where it will stay until more updates are available.

Test Your Pipeline in the RAG Sandbox

After your pipeline is running, click on RAG Pipelines toopen the RAG Sandbox for the pipeline.

In the RAG Sandbox, you can ask questions about the data ingested by the web crawler.
Type a question into the input field (e.g., "What are the key features of Vectorize?"), and click Submit.