Create a RAG Pipeline with Weaviate and a Web Crawler
Last updated
Last updated
Approximate time to complete: 5-10 minutes, excluding prerequisites
This quickstart will walk you through creating and scheduling a pipeline that uses a web crawler to ingest data from the Vectorize documentation, creates vector embeddings using an OpenAI embedding model, and writes the vectors to a Weaviate vector database.
Before starting, ensure you have access to the credentials, connection parameters, and API keys as appropriate for the following:
A Vectorize account (Create one free here ↗ )
An OpenAI API Key (How to article)
An Weaviate account (Create one on Weaviate ↗)
Log in to Weaviate, navigate to Clusters, and click Create cluster.
Select a cluster type. For this quickstart, we'll use "Free."
Enter a cluster name, select the cloud region, then click Create.
Save and securely store your cluster's API key.
Open the Vectorize Application Console ↗
From the dashboard, click on + New RAG Pipeline
under the "RAG Pipelines" section.
Enter a name for your pipeline. For example, you can name it quickstart-pipeline
.
Click on New Vector DB to create a new vector database integration.
Select Weaviate from the list of vector databases.
Enter the parameters in the form using the Weaviate Parameters table below as a guide, then click Create Weaviate Integration.
Name
A descriptive name to identify the integration within Vectorize.
Yes
Endpoint
The cluster's endpoint.
Yes
API key
The cluster's admin API key.
Yes
You can think of the Weaviate integration as having two parts to it. The first is authorization with your Weaviate cluster. This part is re-usable across pipelines and allows you to connect to this same application in different pipelines without providing the credentials every time.
The second part is the configuration that's specific to your RAG Pipeline. This is where you specify the name of the table in your Weaviate database. If the table does not already exist, Vectorize will create it for you.
Enter your collection name to complete configuration of your Weaviate integration for your RAG pipeline.
Click on New AI Platform.
Select OpenAI from the AI platform options.
In the OpenAI configuration screen:
Enter a descriptive name for your OpenAI integration.
Enter your OpenAI API Key.
Leave the default values for embedding model, chunk size, and chunk overlap for the quickstart.
Click on Add Source Connector.
Choose the type of source connector you'd like to use. In this example, select Web Crawler.
Name your web crawler source connector, e.g., vectorize-docs.
Set Seed URL(s) to https://docs.vectorize.io
.
Click Create Web Crawler Integration to proceed.
Accept all the default values for the web crawler pipeline configuration:
Throttle Wait Between Requests: 500 ms
Maximum Error Count: 5
Maximum URLs: 1000
Maximum Depth: 50
Reindex Interval: 3600 seconds
Click Save Configuration.
Verify that your web crawler connector is visible under Source Connectors.
Click Next: Schedule RAG Pipeline to continue.
Accept the default schedule configuration
Click Create RAG Pipeline.
The system will now create, deploy, and backfill the pipeline.
You can monitor the status changes from Creating Pipeline to Deploying Pipeline and Starting Backfilling Process.
Once the initial population is complete, the RAG pipeline will begin crawling the Vectorize docs and writing vectors to your Pinecone index.
Once the website crawling is complete, your RAG pipeline will switch to the Listening state, where it will stay until more updates are available.
After your pipeline is running, open the RAG Sandbox for the pipeline by clicking the RAG Sandbox link on the Pipeline Details page, or the magnifying glass icon on the RAG Pipelines page.
In the RAG Sandbox, you can ask questions about the data ingested by the web crawler.
Type a question into the input field (e.g., "What are the key features of Vectorize?"), and click Submit.
The system will return the most relevant chunks of information from your indexed data, along with an LLM response.
This completes the RAG pipeline quickstart. Your RAG pipeline is now set up and ready for use with Weaviate and Vectorize.