Understanding the RAG Pipelines

RAG Pipelines in Vectorize

Vectorize RAG Pipelines enable you to quickly and easily build vector search indexes from unstructured data sources like documents, PDFs, and knowledge bases. This feature integrates directly with your own database, giving you complete ownership and control over your data. By converting unstructured data into vector embeddings and storing them in your vector database, the RAG Pipeline enables fast, real-time retrieval of relevant information.

RAG Pipeline Setup

Fast and Easy Setup

With the RAG Pipeline feature, your vector indexes can be fully populated within minutes, allowing you to quickly transform your unstructured data into a searchable format. The setup process is streamlined and efficient, so you can begin querying your data almost immediately, without the need for complex configurations or long delays.

High Observability and Visibility

The RAG Pipeline feature provides high observability and real-time visibility into how your data is processed and indexed. As changes occur in your unstructured data sources, these updates are reflected in the vector indexes to ensure that they remain synchronized. This ensures that your vector indexes are always up to date with the latest changes in your data, providing confidence in the accuracy and relevance of the information retrieved.

RAG Pipeline Details

Key Components of the RAG Pipeline Feature

Data Ingestion and Extraction: The RAG Pipeline can ingest unstructured data from a variety of sources, such as Amazon S3, Google Cloud, or local file systems. The data is then processed to extract meaningful text.
Chunking and Embedding: Once the text is extracted, it is split into chunks, and vector embeddings are generated using models such as OpenAI's text-embedding-3-small or text-embedding-3-large. These embeddings are stored in your vector database, allowing you to retain full control over your data.
Vector Indexing: The generated embeddings are indexed in your own vector database (e.g., Pinecone, DataStax Astra), enabling fast, efficient real-time retrieval when querying for relevant information.

S3 Archive

In addition to storing pipeline results in a vector database, Vectorize allows you to archive the results to an S3 bucket. This provides several benefits:

Data Backup: Maintain a copy of your processed data outside the vector database
Data Portability: Use the archived data with other tools and systems
Compliance: Meet data retention and backup requirements

The S3 archive feature writes the processed data to an S3 bucket that you configure during pipeline creation. You can reuse an existing S3 source connector for archiving, but it must be specifically authorized for archive use with the appropriate write permissions.

For more information on configuring S3 archive, see the Creating a RAG Pipeline and Amazon S3 connector documentation.

Benefits

Full Data Ownership: The RAG Pipeline uses your own vector database, so you retain complete control and ownership over your data.
Quick Setup: Populate your vector indexes within minutes, transforming your unstructured data into a searchable format almost immediately.
Real-time Updates: Ensure all changes to your unstructured data sources are reflected in the vector indexes, providing up-to-date information with full visibility.

RAG Pipelines in Vectorize​

Fast and Easy Setup​

High Observability and Visibility​

Key Components of the RAG Pipeline Feature​

S3 Archive​

Benefits​