RAG Pipeline Quickstart with DataStax Astra
Approximate time to complete: 5-10 minutes, excluding prerequisites
This quickstart will walk you through creating and scheduling a pipeline that collects data from an Amazon S3 bucket, creates vector embeddings using an OpenAI embedding model, and writes the vectors to your DataStax Astra vector database.
Before you begin
Before starting, ensure you have access to the credentials, connection parameters, and API keys as appropriate for the following:
A Vectorize account (Create one free here ↗ )
Amazon S3 bucket & IAM access keys (Walkthrough documentation)
An OpenAI API Key (How to article)
A DataStax Astra account (Create one on DataStax ↗ )
Step 1: Create a Serverless Astra Vector Database
Create and Configure Database
Log into the DataStax Astra Portal ↗
From the sidebar, click on the Databases option.
On the top-right side of the "Serverless" section, click on the Create Serverless Database button.
In the pop-up dialog, select Serverless (Vector) as the deployment type.
Under "Configuration":
Database Name: Enter your desired name (e.g.,
quickstart-db
).Provider: Select
Amazon Web Services
from the dropdown.Region: Select
us-east-2
.
Click Create Database.
After clicking create, your database will begin initializing.
The initialization is very slow. Be patient and wait for it to complete.
Generate Token and Copy API Endpoint
Once the database is active, click Generate Token from the right-hand side under "Application Tokens".
Copy the generated token and save it securely. You will not be able to retrieve this token again after closing the dialog.
On the database overview page, copy the API Endpoint.
You will use this endpoint when connecting to your database via API.
Step 2: Create a RAG pipeline
Set Up Pipeline and Vector Database
Open the Vectorize application console ↗.
From the dashboard, click on
+ New RAG Pipeline
under the "RAG Pipelines" section.
Enter a name for your pipeline. For example, you can name it
quickstart-pipeline
.Click on
+ New Vector DB
to create a new vector database.
Select DataStax Astra from the list of vector databases.
In the DataStax configuration screen:
Enter a descriptive name for your DataStax integration. It can be the same as the database name in Astra but doesn't need to be.
Enter your DataStax API Endpoint.
Enter your DataStax Application Token.
In the DataStax Astra section, provide a name for your collection
This is the name of the collection where Vectorize will write your vector data to.
Configure AI Platform
Click on
+ New AI Platform
.
Select OpenAI from the AI platform options.
In the OpenAI configuration screen:
Enter a descriptive name for your OpenAI integration.
Enter your OpenAI API Key.
Leave the default values for embedding model, chunk size, and chunk overlap for the quickstart.
Add Source Connector
Click
+ Add source connector
to add a source connector to your pipeline.
Choose Amazon S3 from the list of source connector options.
In the Amazon S3 configuration screen:
Name your integration. It can be the same as your bucket name, but it doesn't have to be.
Enter your Bucket Name exactly as it appears in AWS.
Provide the Access Key and Secret Key for your AWS IAM user.
Accept the default values for file extensions and other options.
Click Save Configuration.
Finalize Pipeline Creation
After configuring the S3 integration, you should see it listed under Source Connectors.
Click Next: Schedule Pipeline to continue.
Set the schedule type and frequency for the pipeline.
Leave the default values for the pipeline schedule for now.
Click Create RAG Pipeline.
Monitor Pipeline Creation and Backfilling
After clicking Create RAG Pipeline, you will see the pipeline creation progress.
The stages include:
Creating pipeline
Deploying pipeline
Starting backfilling process
Once the pipeline is created and deployed, it will begin the backfilling process.
You can monitor the pipeline status and view the progress of document ingestion and vector creation.
If your S3 bucket is empty, the pipeline will show
0 Documents
,0 Chunks
, and0 Vectors
.
Step 3: Upload Files to Your S3 Bucket
Prepare Sample Data
Download the
friends-scripts.zip
file from the following location:
This archive contains text files of the TV show Friends, which we can use as a sample data set.
After downloading the
friends-scripts.zip
file, extract it to a location on your local machine.On most operating systems, you can do this by right-clicking the zip file and selecting Extract All or Unzip.
Upload Files to S3
Log into your AWS S3 account and navigate to the Buckets section.
Filter to find your bucket by typing its name in the search bar.
Click on your bucket name to open the detailed bucket view.
Click on the Upload button in the top right corner of the bucket's detail view.
You can either drag and drop the extracted files from the
friends-scripts
directory into the upload area, or click on Add files to browse your local machine and select them manually.
After adding the files, you should see them listed under the Files and folders section of the upload screen.
Once you've confirmed that all the files are listed, click on the Upload button at the bottom of the screen to start the upload process.
Your files will now be uploaded to your S3 bucket.
Verify Pipeline Processing
Within a few seconds after the upload is complete, you should see the content of your files start to populate in the RAG pipeline.
The backfilling process will show progress as it reads and processes the documents from your S3 bucket.
Total Documents and Total Chunks will increase as the documents are embedded and processed.
You can track the number of documents being embedded and vectors being written.
After a minute or two of processing, you should see the total number of uploaded documents reflected in the pipeline's statistics.
If you used the Friends Scripts documents as recommended, you will see 228 documents displayed in the Total Documents field.
Step 4: Play with Your Data in the RAG Sandbox
Access the RAG Sandbox
From the main pipeline overview, click on the RAG Pipelines menu item to view your active pipelines.
Find your pipeline in the list of pipelines.
Click on the magnifying glass icon under the RAG Sandbox column to open the sandbox for your selected pipeline.
Query Your Data
In the sandbox, you can ask questions about the data you've ingested.
Type a question related to your dataset in the Question field. For example, "What characteristics define the relationship between Ross and Monica?" if you're working with the Friends TV show scripts.
Click Submit to send the question.
Review Results
After submitting your question, the sandbox will retrieve relevant chunks from your vector database and display them in the Retrieved Context section.
The response from the language model (LLM) will be displayed in the LLM Response section.
The Retrieved Context section shows the chunks that were matched with your question.
The LLM Response section provides the final output based on the retrieved chunks.
You can continue to ask different questions or refine your queries to explore your dataset further.
The sandbox allows for dynamic interactions with the data stored in your vector database.
That's it! You're now able to explore your data using the RAG Sandbox.
Last updated