Make Your AI Smarter with Metadata
In this guide, you'll learn how to enhance your RAG pipeline with metadata to create more precise, context-aware AI applications. By adding metadata to your documents, you can filter search results, improve relevance, and build smarter retrieval workflows.
Prerequisites
Before you begin, you'll need:
- A Vectorize account
- An API access token (how to create one)
- Your organization ID (see below)
Finding your Organization ID
Your organization ID is in the Vectorize platform URL:
https://platform.vectorize.io/organization/[YOUR-ORG-ID]
For example, if your URL is:
https://platform.vectorize.io/organization/ecf3fa1d-30d0-4df1-8af6-f4852bc851cb
Your organization ID is: ecf3fa1d-30d0-4df1-8af6-f4852bc851cb
API Client Setup
- Python
- Node.js
import vectorize_client as v
import os
# Get credentials from environment variables
organization_id = "your-org-id"
api_key = "your-api-key"
if not organization_id or not api_key:
raise ValueError("Please set VECTORIZE_ORGANIZATION_ID and VECTORIZE_API_KEY environment variables")
# Initialize the API client
configuration = v.Configuration(
host="https://api.vectorize.io",
api_key={"ApiKeyAuth": api_key}
)
api = v.ApiClient(configuration)
print(f"✅ API client initialized for organization: {"your-org-id"}")
const v = require('@vectorize-io/vectorize-client');
// Get credentials from environment variables
const organizationId = "your-env-value";
const "your-api-key" = "your-env-value";
if (!organizationId || !"your-api-key") {
throw new Error("Please set VECTORIZE_ORGANIZATION_ID and VECTORIZE_API_KEY environment variables");
}
// Initialize the API client
const configuration = new v.Configuration({
basePath: 'https://api.vectorize.io',
accessToken: "your-api-key"
});
const apiClient = new v.ApiClient(configuration);
console.log(`✅ API client initialized for organization: ${organizationId}`);
What You'll Build
You'll create a RAG pipeline that processes documents with rich metadata, enabling you to:
- Filter searches by department, document type, or any custom field
- Build context-aware retrieval that understands document relationships
- Create role-based or category-specific search experiences
Understanding Metadata in RAG
Metadata is additional information about your documents that helps retrieval systems understand context. Think of it like labels on file folders — they help you find exactly what you need without opening every folder.
For example, a technical document might have metadata like:
- Department: Engineering
- Document Type: Requirements
- Status: Approved
- Last Updated: 2025-01-15
With metadata, your queries can answer things like:
- "What are the engineering requirements?" (filters by department)
- "Show me approved marketing strategies" (filters by status and department)
Step 1: Create a File Upload Connector
First, create a connector to upload your documents:
# Create the connectors API client
connectors_api = v.SourceConnectorsApi(api)
try:
# Create a file upload connector
file_upload = v.FileUpload(
name="metadata-enhanced-documents",
type="FILE_UPLOAD",
config={}
)
request = v.CreateSourceConnectorRequest(file_upload)
response = connectors_api.create_source_connector(
organization_id,
request
)
connector_id = response.connector.id
print(f"✅ Created file upload connector: {connector_id}")
except Exception as e:
print(f"❌ Error creating connector: {e}")
raise
Step 2: Upload Documents with Metadata
Now upload documents while attaching metadata as a JSON string:
import urllib3
# Create uploads API client
uploads_api = v.UploadsApi(api)
http = urllib3.PoolManager()
# Example: Upload a document with its metadata
# Download sample files from: /files/metadata-sample-docs.zip
file_path = "product_requirements.txt"
metadata = {
"department": "engineering",
"document_type": "requirements",
"project": "ai-search",
"status": "draft",
"created_date": "2024-01-15",
"priority": "high"
}
try:
# Convert metadata to JSON string
metadata_json = json.dumps(metadata)
# Step 1: Get upload URL with metadata
upload_request = v.StartFileUploadToConnectorRequest(
name=file_path,
content_type="text/plain",
metadata=metadata_json # Metadata as JSON string
)
start_response = uploads_api.start_file_upload_to_connector(
organization_id,
source_connector_id,
start_file_upload_to_connector_request=upload_request
)
# Step 2: Upload file
with open(file_path, "rb") as f:
response = http.request(
"PUT",
start_response.upload_url,
body=f,
headers={
"Content-Type": "text/plain",
"Content-Length": str(os.path.getsize(file_path))
}
)
if response.status == 200:
print(f"✅ Uploaded: {file_path}")
print(f" Metadata: {list(metadata.keys())}")
except Exception as e:
print(f"❌ Error uploading {file_path}: {e}")
# Repeat for other documents with different metadata
# See the sample files for more examples
How Metadata Works
When you upload a document with metadata:
- The metadata is stored as a JSON string during upload.
- Vectorize preserves this metadata with each chunk of your document.
- You can filter searches using these metadata fields.
Note: Metadata values should use consistent types across documents (e.g.,
"year": "2024"
as a string everywhere).
Step 3: Create Your Pipeline
Create a pipeline just like in the first guide. No special configuration is required for user-defined metadata — Vectorize automatically handles it:
# Create pipelines API client
pipelines_api = v.PipelinesApi(api)
try:
# Create pipeline - metadata handling is automatic
pipeline_config = v.PipelineConfigurationSchema(
pipeline_name="Metadata-Enhanced RAG Pipeline",
source_connectors=[
v.PipelineSourceConnectorSchema(
id=source_connector_id,
type="FILE_UPLOAD",
config={} # No special config needed for metadata
)
],
ai_platform_connector=v.PipelineAIPlatformConnectorSchema(
id=ai_platform_connector_id,
type="VECTORIZE",
config={}
),
destination_connector=v.PipelineDestinationConnectorSchema(
id=destination_connector_id,
type="VECTORIZE",
config={}
),
schedule=v.ScheduleSchema(type="manual")
)
# Create the pipeline
response = pipelines_api.create_pipeline(
organization_id,
pipeline_config
)
pipeline_id = response.data.id
print(f"✅ Created pipeline: {pipeline_id}")
print(f" Documents with metadata will be automatically processed")
print(f" Metadata will be preserved and searchable")
except Exception as e:
print(f"❌ Error creating pipeline: {e}")
raise
For automatic metadata extraction, refer to Automatic Metadata Extraction.
Step 4: Wait for Processing
Monitor your pipeline until it's ready:
# Create pipelines API client
pipelines_api = v.PipelinesApi(api)
print("Waiting for metadata extraction and indexing...")
max_wait_time = 300
start_time = time.time()
while True:
try:
pipeline = pipelines_api.get_pipeline(organization_id, pipeline_id)
status = pipeline.data.status
if status == "LISTENING":
print("✅ Pipeline ready with metadata indexes!")
break
elif status in ["ERROR_DEPLOYING", "SHUTDOWN"]:
print(f"❌ Pipeline error: {status}")
break
if time.time() - start_time > max_wait_time:
print("⏰ Timeout waiting for pipeline")
break
time.sleep(10)
except Exception as e:
print(f"❌ Error checking status: {e}")
break
Step 5: Query Without Metadata Filters
First, query without any filters to see baseline behavior:
# Create pipelines API client
pipelines_api = v.PipelinesApi(api)
try:
# Query without any metadata filters
response = pipelines_api.retrieve_documents(
organization_id,
pipeline_id,
v.RetrieveDocumentsRequest(
question="What are the technical requirements for the AI search?",
num_results=5
)
)
# Display results
print(f"Query: 'What are the technical requirements for the AI search?'")
print(f"Results without filtering (searches all documents):\n")
for i, doc in enumerate(response.documents, 1):
print(f"Result {i}:")
print(f" Content: {doc.text[:150]}...")
print(f" Relevance Score: {doc.relevancy}")
print(f" Document ID: {doc.id}")
# Show metadata if available
if hasattr(doc, 'metadata') and doc.metadata:
print(f" Department: {doc.metadata.get('department', 'N/A')}")
print(f" Document Type: {doc.metadata.get('document_type', 'N/A')}")
print()
except Exception as e:
print(f"❌ Error querying pipeline: {e}")
raise
Without filters, your search might return:
- Marketing documents when you wanted technical specs.
- Draft content mixed with approved versions.
- Results from all departments.
Step 6: Query With Metadata Filters
Now query using metadata filters for precise results:
# Create pipelines API client
pipelines_api = v.PipelinesApi(api)
try:
# Query with metadata filters for engineering documents only
response = pipelines_api.retrieve_documents(
organization_id,
pipeline_id,
v.RetrieveDocumentsRequest(
question="What are the technical requirements for the AI search?",
num_results=5,
metadata_filters=[
{
"metadata.department": ["engineering"]
},
{
"metadata.document_type": ["requirements", "architecture"]
}
]
)
)
# Display filtered results
print(f"Query: 'What are the technical requirements for the AI search?'")
print(f"Filters: department=engineering AND type IN (requirements, architecture)")
print(f"Results (engineering docs only):\n")
for i, doc in enumerate(response.documents, 1):
print(f"Result {i}:")
print(f" Content: {doc.text[:150]}...")
print(f" Relevance Score: {doc.relevancy}")
print(f" Document ID: {doc.id}")
# Show metadata to confirm filtering worked
if hasattr(doc, 'metadata') and doc.metadata:
print(f" Department: {doc.metadata.get('department', 'N/A')}")
print(f" Document Type: {doc.metadata.get('document_type', 'N/A')}")
print(f" Project: {doc.metadata.get('project', 'N/A')}")
print()
except Exception as e:
print(f"❌ Error querying pipeline: {e}")
raise
Metadata Filter Syntax
- Use the
metadata.
prefix for user-defined fields. - Provide values as arrays (even for single values).
- Multiple values for the same key use OR logic.
- Different keys use AND logic.
Example filter structure:
[
{ "metadata.department": ["engineering", "product"] },
{ "metadata.status": ["approved"] }
]