Create a RAG Pipeline with Zilliz Cloud and a Web Crawler

This quickstart will walk you through creating and scheduling a pipeline that uses a web crawler to ingest data from the Vectorize documentation, creates vector embeddings using an OpenAI embedding model, and writes the vectors to a Milvus vector database.

Milvus is the underlying vector database; Zilliz Cloud is the fully managed service of Milvus.

Before you begin

Before starting, ensure you have access to the credentials, connection parameters, and API keys as appropriate for the following:

Step 1: Create a Zilliz Cloud Database

These instructions show how to create a cluster and database on Zilliz' free plan. A cluster is a managed instance of Milvus.

  1. Log in to Zilliz, and select Clusters in the menu.

  2. Select Create Free Cluster.

    Create Free Cluster
  3. Choose "Free," name your cluster, select your cloud region, then click Create.

    Configure and Create Cluster
  4. Save and securely store your username and password.

    Save Username and Password
  5. Your cluster will be created.

    Cluster Creation
  6. Once your cluster has been created, it'll show up as Running.

    Running Cluster

Step 2: Create a RAG Pipeline on Vectorize

Create a New RAG Pipeline

To configure a vector database integration to connect to your Zilliz Cloud instance:

  1. Click Vector Databases from the main menu.

  2. Click New Vector Database Integration from the Vector Databases page.

  3. Select the Milvus card.

    Milvus Card
  4. Enter the parameters in the form using the Milvus Parameters table below as a guide, then click Create Milvus Integration.

    Create Milvus Integration

Milvus Parameters

Field
Description
Required

Name

A descriptive name to identify the integration within Vectorize.

Yes

Public Endpoint

The public endpoint for your cluster.

Yes

Token

The cluster's token.

Yes, unless you provide a username/password

Username

The cluster's username.

Yes, unless you provide a token

Password

The cluster's password.

Yes, unless you provide a token

When you specify your Milvus integration in your pipeline configuration, Vectorize writes vector data to your Milvus instance.

Configuring the Milvus integration in a RAG Pipeline

You can think of the Milvus integration as having two parts to it. The first is authorization with your Milvus cluster. This part is re-usable across pipelines and allows you to connect to this same application in different pipelines without providing the credentials every time.

The second part is the configuration that's specific to your RAG Pipeline. This is where you specify the name of the collection in your Milvus cluster. If the collection does not already exist, Vectorize will create it for you.

Create Milvus Integration

Configure AI Platform

  1. Click on + New AI Platform.

    New AI Platform
  2. Select OpenAI from the AI platform options.

    Select OpenAI
  3. In the OpenAI configuration screen:

    • Enter a descriptive name for your OpenAI integration.

    • Enter your OpenAI API Key.

    Configure OpenAI
  4. Leave the default values for embedding model, chunk size, and chunk overlap for the quickstart.

    Set Embedding Model

Add Source Connectors

  1. Click on Add Source Connector.

Web Crawler Source
  1. Choose the type of source connector you'd like to use. In this example, select Web Crawler.

Choose Web Crawler

Configure Web Crawler Integration

  1. Name your web crawler source connector, e.g., vectorize-docs.

  2. Set Seed URL(s) to https://docs.vectorize.io.

Configure Web Crawler
  1. Click Create Web Crawler Integration to proceed.

Configure Web Crawler Pipeline

  1. Accept all the default values for the web crawler pipeline configuration:

    • Throttle Wait Between Requests: 500 ms

    • Maximum Error Count: 5

    • Maximum URLs: 1000

    • Maximum Depth: 50

    • Reindex Interval: 3600 seconds

Web Crawler Pipeline Configuration
  1. Click Save Configuration.

Verify Source Connector and Schedule Pipeline

  1. Verify that your web crawler connector is visible under Source Connectors.

  2. Click Next: Schedule RAG Pipeline to continue.

Verify Source Connector

Schedule RAG Pipeline

  1. Accept the default schedule configuration

  2. Click Create RAG Pipeline.

Schedule RAG Pipeline

Step 3: Monitor and Test Your Pipeline

Monitor Pipeline Creation and Backfilling

  1. The system will now create, deploy, and backfill the pipeline.

  2. You can monitor the status changes from Creating Pipeline to Deploying Pipeline and Starting Backfilling Process.

Pipeline Creation
  1. Once the initial population is complete, the RAG pipeline will begin crawling the Vectorize docs and writing vectors to your Milvus index.

Pipeline Backfilling

View RAG Pipeline Status

  1. Once the website crawling is complete, your RAG pipeline will switch to the Listening state, where it will stay until more updates are available.

Pipeline Listening State

Test Your Pipeline in the RAG Sandbox

  1. After your pipeline is running, click on RAG Pipelines toopen the RAG Sandbox for the pipeline.

RAG Pipelines Page
  1. In the RAG Sandbox, you can ask questions about the data ingested by the web crawler.

  2. Type a question into the input field (e.g., "What are the key features of Vectorize?"), and click Submit.

Ask Questions in Sandbox
  1. The system will return the most relevant chunks of information from your indexed data, along with an LLM response.

This completes the RAG pipeline quickstart. Your RAG pipeline is now set up and ready for use with Zilliz Cloud and Vectorize.

Last updated