Create a RAG Pipeline with Elastic and a Web Crawler

RAG Pipeline Quickstart with Elastic

Approximate time to complete: 5-10 minutes, excluding prerequisites

This quickstart will walk you through creating and scheduling a pipeline that uses a web crawler to ingest data from the Vectorize documentation, creates vector embeddings using an OpenAI embedding model, and writes the vectors to an Elasticsearch vector database.

Before you begin

Before starting, ensure you have access to the credentials, connection parameters, and API keys as appropriate for the following:

Step 1: Create an Elasticsearch Deployment

Create and Configure Project

  1. Navigate to the Elastic Cloud console and click Create project under the Serverless projects section.

  2. Select Elasticsearch for building custom applications with your data, and click Next.

  3. Name your project (e.g., vectorize-quickstart).

  4. Under Configuration, choose Optimized for Vectors.

  5. Click Create project to initialize.

  6. Once initialization completes, click Continue.

Generate API Key and Save Connection Details

  1. Scroll down to the API Key section and click New to create a key.

  2. Enter a name for your key (e.g., vectorize-quickstart) and optionally set an expiration date.

  3. Click Create API key.

  4. Copy the generated API key and save it securely—you won't be able to retrieve it later.

  5. Copy your Elasticsearch endpoint URL as well. You'll need this to connect to your deployment.

Step 3: Create a RAG Pipeline on Vectorize

Create a New RAG Pipeline

  1. From the dashboard, click on + New RAG Pipeline under the "RAG Pipelines" section.

  2. Enter a name for your pipeline. For example, you can name it quickstart-pipeline.

  3. Click on + New Vector DB to create a new vector database.

  4. Select Elastic Cloud from the list of vector databases.

  5. In the Elastic Cloud configuration screen:

    • Enter a descriptive name for your Elastic Cloud integration.

    • Enter the Host, Port, and your Elastic API Key.

  6. Provide the index name you want to use in Elastic

    • The Index Name can be the same as your pipeline name.

Configure AI Platform

  1. Click on + New AI Platform.

  2. Select OpenAI from the AI platform options.

  3. In the OpenAI configuration screen:

    • Enter a descriptive name for your OpenAI integration.

    • Enter your OpenAI API Key.

  4. Leave the default values for embedding model, chunk size, and chunk overlap for the quickstart.

Add Source Connectors

  1. Click on Add Source Connector.

  1. Choose the type of source connector you'd like to use. In this example, select Web Crawler.

Configure Web Crawler Integration

  1. Name your web crawler source connector, e.g., vectorize-docs.

  2. Set both Seed URL(s) and Allowed URL(s) or prefix(es) to https://docs.vectorize.io.

  1. Click Create Web Crawler Integration to proceed.

Configure Web Crawler Pipeline

  1. Accept all the default values for the web crawler pipeline configuration:

    • Throttle Wait Between Requests: 500 ms

    • Maximum Error Count: 5

    • Maximum URLs: 1000

    • Maximum Depth: 50

    • Reindex Interval: 3600 seconds

  1. Click Save Configuration.

Verify Source Connector and Schedule Pipeline

  1. Verify that your web crawler connector is visible under Source Connectors.

  2. Click Next: Schedule RAG Pipeline to continue.

Schedule RAG Pipeline

  1. Accept the default schedule configuration

  2. Click Create RAG Pipeline.

Step 4: Monitor and Test Your Pipeline

Monitor Pipeline Creation and Backfilling

  1. The system will now create, deploy, and backfill the pipeline.

  2. You can monitor the status changes from Creating Pipeline to Deploying Pipeline and Starting Backfilling Process.

  1. Once the initial population is complete, the RAG pipeline will begin crawling the Vectorize docs and writing vectors to your Pinecone index.

View RAG Pipeline Status

  1. Once the website crawling is complete, your RAG pipeline will switch to the Listening state, where it will stay until more updates are available.

  1. Your vector index is now populated and we can try it out using the RAG Sandbox, to do so click on RAG Pipelines from the left hand menu.

Test Your Pipeline in the RAG Sandbox

  1. After your pipeline is running, open the RAG Sandbox for the pipeline by clicking the magnifying glass icon on the RAG Pipelines page.

  1. In the RAG Sandbox, you can ask questions about the data ingested by the web crawler.

  2. Type a question into the input field (e.g., "What are the key features of Vectorize?"), and click Submit.

  1. The system will return the most relevant chunks of information from your indexed data, along with an LLM response.

This completes the RAG pipeline quickstart. Your RAG pipeline is now set up and ready for use with Elastic and Vectorize.

Last updated