Creating a RAG Pipeline

This page relates to page-based plans.

🔍 Looking for our legacy usage-based plan documentation? Click here.

The process of creating a RAG pipeline in Vectorize is simple and divided into three main steps:

Configuring the vector database and vectorization strategy.
Configuring one or more source connectors to ingest data from.
Configuring when the pipeline should update the vector indexes.

Below are the detailed instructions on how to configure a RAG pipeline, using the provided images as a guide.

Create a New RAG Pipeline

From the Vectorize dashboard, select New RAG Pipeline from the RAG Pipelines section on the left sidebar.
This will take you to the pipeline configuration screen, where you'll be asked to name your pipeline and configure its components. Here you can provide a name for your pipeline.

Configure Your Pipeline

You may configure the pipeline in any order, but the following steps are recommended:

First, select a source connector to ingest data from. You can either select an existing source connector or create a new one. Click Select Source to add a new source connector.
When you click on the box to add a source connector, you will be presented with a list of available source connectors.

Source Connector List

Once you select the source connector you want to use, you will be presented with options to configure the connection to the source system. You can find documentation for how to configure each individual connector in the connector's documentation.

Next you can choose the extractor and chunker for your pipeline. The extractor is responsible for extracting text from the source data, while the chunker is responsible for breaking the text into smaller chunks. There are three options for the extracting strategy:
- Fast: This is our fastest and most lightweight extractor.
- Vectorize Iris: This extractor is excellent for PDFs and other complex documents.
- Mixed: This extractor uses both Fast and Vectorize Iris depending on the file type. You may also choose the chunk size. You may want smaller chunks for more granular search results, or larger chunks for more context. Lastly you can configure the chunk overlap. This is the number of words that will overlap between chunks. This can help with context and search results.
Next is the embedder configuration. The embedder is what will generate the vector embeddings for the text data. You can choose between the built-in embedder or a custom embedder. The built-in embedder is free to use and is a great starting point for your pipeline. If you have a custom embedder, you can select it here.
Finally, select a vector database where the vector embeddings will be stored. You can choose between the built-in vector database or a custom vector database if you wish to view the embeddings in a different system.

Selecting Bring your own database will bring up a list of available vector databases you can integrate with. You can refer to the relevant documentation on each connector for more details about how to configure the integration. Select Vector Database

Once you configure a vector database integration for a RAG Pipeline, it will become available to reuse in future pipelines. If you choose to bring your own database you will need to configure the collection/index name. You can choose to use an already existing collection/index or we can automatically create a new one for you. Metadata Filtering

Configuring Metadata Extraction (Optional)

Metadata extraction allows you to automatically extract structured information from your documents based on defined schemas. This extracted metadata can enhance your retrieval capabilities by providing additional context and filtering options.

To configure metadata extraction for your pipeline:

In the pipeline editor, locate and enable the metadata extraction node.
Select the metadata schemas you want to apply to your documents. You can choose from:
- System schemas: Pre-defined schemas for common document types
- Custom schemas: Schemas you've created using the metadata schema editor
Configure metadata settings:
- Add document metadata to chunks: When enabled, document-level metadata (like title, author, document type) will be added to the text of each chunk. This improves retrieval for documents that span many chunks.
- Add section metadata to chunks: When enabled, section-level metadata (like part numbers, prices, technical details) will be added to the text of the chunk. This improves retrieval for documents with very specific value strings.

Adding metadata to chunks can significantly improve retrieval quality by making important information available during semantic search.

For more information on metadata extraction, see Automatic Metadata Extraction.

Configuring S3 Archive (Optional)

Vectorize allows you to archive the results of your pipeline to an S3 bucket in addition to storing them in a vector database. This provides a backup of your pipeline results and enables additional use cases for the processed data.

To configure S3 archive for your pipeline:

During pipeline creation, after configuring your vector database, you'll have the option to configure an archive destination.
Select an existing S3 source connector that has been enabled for archive use. Only S3 connectors with the "Allow as archive destination" option enabled will be available for selection.
Optionally, specify an output path within the S3 bucket where the archived data should be stored.
Complete the pipeline configuration and deploy the pipeline.

When the pipeline runs, the results will be written to both the configured vector database and the S3 bucket.

Note: The S3 connector used for archiving must have write permissions to the bucket. See the Amazon S3 connector documentation for more details on the required permissions.

Save or Deploy the Pipeline

After configuring the pipeline, you can choose to save it as a draft or deploy it immediately. If you save it as a draft, you can come back and deploy it later. If you deploy it immediately, the pipeline will start ingesting data and generating embeddings. Save or Deploy Pipeline

Backfilling

After deploying, your pipeline will always run immediately to backfill the vector index.

Backfilling Progress

Auto Layout

You can toggle the "Auto Layout" option to automatically arrange the components of your pipeline for better visualization.

Done

You have successfully created a RAG pipeline in Vectorize. You can now start ingesting data and generating embeddings. If you have any questions or need help, please reach out to our support team.

Create a New RAG Pipeline​

Configure Your Pipeline​

Configuring Metadata Extraction (Optional)​

Configuring S3 Archive (Optional)​

Save or Deploy the Pipeline​

Backfilling​

Auto Layout​

Done​