Make Your AI Smarter with Metadata
In this guide, you'll learn how to enhance your retrieval pipeline with metadata to create more precise, context-aware results. By adding metadata to your documents, you can filter search results, improve relevance, and build retrieval pipelines with richer filtering and context for your connected LLM or application.
Prerequisites
Before you begin, you'll need:
- A Vectorize account
- An API access token (how to create one)
- Your organization ID (see below)
Finding your Organization ID
Your organization ID is in the Vectorize platform URL:
https://platform.vectorize.io/organization/[YOUR-ORG-ID]
For example, if your URL is:
https://platform.vectorize.io/organization/ecf3fa1d-30d0-4df1-8af6-f4852bc851cb
Your organization ID is: ecf3fa1d-30d0-4df1-8af6-f4852bc851cb
API Client Setup
- Python
- Node.js
import vectorize_client as v
import os
# Get credentials from environment variables
organization_id = os.environ.get("VECTORIZE_ORGANIZATION_ID")
api_key = os.environ.get("VECTORIZE_API_KEY")
if not organization_id or not api_key:
raise ValueError("Please set VECTORIZE_ORGANIZATION_ID and VECTORIZE_API_KEY environment variables")
# Initialize the API client
configuration = v.Configuration(
host="https://api.vectorize.io",
api_key={"ApiKeyAuth": api_key}
)
api = v.ApiClient(configuration)
print(f"✅ API client initialized for organization: {organization_id}")
const v = require('@vectorize-io/vectorize-client');
// Get credentials from environment variables
const organizationId = process.env.VECTORIZE_ORGANIZATION_ID;
const apiKey = process.env.VECTORIZE_API_KEY;
if (!organizationId || !apiKey) {
throw new Error("Please set VECTORIZE_ORGANIZATION_ID and VECTORIZE_API_KEY environment variables");
}
// Initialize the API client
const configuration = new v.Configuration({
basePath: 'https://api.vectorize.io',
accessToken: apiKey
});
const apiClient = new v.ApiClient(configuration);
console.log(`✅ API client initialized for organization: ${organizationId}`);
What You'll Build
You'll create a RAG pipeline that processes documents with rich metadata, enabling you to:
- Filter searches by project, document type, or any custom field
- Build a context-aware retrieval pipeline that understands document relationships
- Create project-based or category-specific search experiences
Understanding Metadata in RAG
Metadata is additional information about your documents that helps retrieval systems understand context. Think of it like labels on file folders - they help you find exactly what you need without opening every folder.
For example, a technical document might have metadata like:
- Project: apollo
- Document Type: requirements
- Status: approved
- Year: 2024
- Priority: high
With metadata, your queries can answer things like:
- "What are the apollo project requirements?" (filters by project)
- "Show me approved documents from 2024" (filters by status and year)
Step 1: Create a File Upload Connector
First, create a connector to upload your documents:
- Python
- Node.js
import vectorize_client as v
# Create the connectors API client
connectors_api = v.SourceConnectorsApi(apiClient)
try:
# Create a file upload connector
file_upload = v.FileUpload(
name="metadata-enhanced-documents",
type="FILE_UPLOAD",
config={}
)
request = v.CreateSourceConnectorRequest(file_upload)
response = connectors_api.create_source_connector(
organization_id,
request
)
connector_id = response.connector.id
print(f"✅ Created file upload connector: {connector_id}")
except Exception as e:
print(f"❌ Error creating connector: {e}")
raise
// Create the connectors API client
const connectorsApi = new v.SourceConnectorsApi(apiClient);
try {
// Create a file upload connector
const fileUpload = new v.FileUpload({
name: "metadata-enhanced-documents",
type: "FILE_UPLOAD",
config: {}
});
const request = new v.CreateSourceConnectorRequest({ fileUpload });
const response = await connectorsApi.createSourceConnector(
"your-org-id",
request
);
const connectorId = response.connector.id;
console.log(`✅ Created file upload connector: ${connectorId}`);
} catch (error) {
console.error(`❌ Error creating connector: ${error}`);
throw error;
}
Step 2: Upload Documents with Metadata
Now upload documents while attaching metadata as a JSON string:
- Python
- Node.js
import vectorize_client as v
import os
import json
import urllib3
# Create uploads API client
uploads_api = v.UploadsApi(apiClient)
http = urllib3.PoolManager()
# Example: Upload a document with its metadata
# Download sample files from: /files/metadata-sample-docs.zip
file_path = "product_requirements.txt"
# For testing: use full path if available
if 'test_data_dir' in locals():
file_path = str(test_data_dir / file_path)
metadata = {
"document_type": "requirements",
"project": "apollo",
"year": "2024",
"status": "approved",
"created_date": "2024-01-15",
"priority": "high"
}
try:
# Convert metadata to JSON string
metadata_json = json.dumps(metadata)
# Step 1: Get upload URL with metadata
upload_request = v.StartFileUploadToConnectorRequest(
name=file_path,
content_type="text/plain",
metadata=metadata_json # Metadata as JSON string
)
start_response = uploads_api.start_file_upload_to_connector(
organization_id,
source_connector_id,
start_file_upload_to_connector_request=upload_request
)
# Step 2: Upload file
with open(file_path, "rb") as f:
response = http.request(
"PUT",
start_response.upload_url,
body=f,
headers={
"Content-Type": "text/plain",
"Content-Length": str(os.path.getsize(file_path))
}
)
if response.status == 200:
print(f"✅ Uploaded: {file_path}")
print(f" Metadata: {list(metadata.keys())}")
except Exception as e:
print(f"❌ Error uploading {file_path}: {e}")
# Repeat for other documents with different metadata
# See the sample files for more examples
const fetch = require('node-fetch');
// Create uploads API client
const uploadsApi = new v.UploadsApi(apiClient);
// Example: Upload a document with its metadata
// Download sample files from: /files/metadata-sample-docs.zip
// The file to upload
let filePath = "product_requirements.txt";
const fileName = "product_requirements.txt";
// For testing: use full path if testDataDir is available
if (typeof testDataDir !== 'undefined') {
filePath = path.join(testDataDir, filePath);
}
// Metadata for this document
const metadata = {
document_type: "requirements",
project: "apollo",
year: "2024",
status: "approved",
created_date: "2024-01-15",
priority: "high"
};
try {
// Convert metadata to JSON string
const metadataJson = JSON.stringify(metadata);
// Step 1: Get upload URL with metadata
const uploadRequest = new v.StartFileUploadToConnectorRequest({
name: fileName,
contentType: "text/plain",
metadata: metadataJson // Metadata as JSON string
});
const startResponse = await uploadsApi.startFileUploadToConnector(
"your-org-id",
sourceConnectorId,
{ startFileUploadToConnectorRequest: uploadRequest }
);
// Step 2: Upload file
const fileContent = fs.readFileSync(filePath);
const uploadResponse = await fetch(startResponse.uploadUrl, {
method: 'PUT',
body: fileContent,
headers: {
'Content-Type': 'text/plain',
'Content-Length': fs.statSync(filePath).size
}
});
if (uploadResponse.ok) {
console.log(`✅ Uploaded: ${fileName}`);
console.log(` Metadata: ${Object.keys(metadata).join(', ')}`);
} else {
console.log(`❌ Upload failed: ${uploadResponse.status}`);
}
} catch (error) {
console.error(`❌ Error uploading ${fileName}: ${error}`);
}
// Repeat for other documents with different metadata
// See the sample files for more examples
How Metadata Works
Vectorize stores your metadata with each chunk of your document, so filters are applied before retrieval returns results to your model.
When you upload a document with metadata:
- The metadata is stored as a JSON string during upload.
- Vectorize preserves this metadata with each chunk of your document.
- You can filter searches using these metadata fields.
Note: Metadata values should use consistent types across documents (e.g.,
"year": "2024"
as a string everywhere).
Metadata Best Practices
- Use consistent field names and types - Always use the same casing and data types
- Keep values simple - Stick to strings and numbers for maximum compatibility
- Plan your schema before uploading - Design your metadata structure upfront
- Include only fields you'll use in queries - Avoid metadata bloat that won't be filtered on
- Document your schema - Keep a reference of allowed fields and values for your team
Step 3: Create Your Pipeline
Create a pipeline just like in the first guide. No special configuration is required for user-defined metadata - Vectorize automatically handles it:
- Python
- Node.js
import vectorize_client as v
# Create pipelines API client
pipelines_api = v.PipelinesApi(apiClient)
try:
# Create pipeline - metadata handling is automatic
pipeline_config = v.PipelineConfigurationSchema(
pipeline_name="Metadata-Enhanced RAG Pipeline",
source_connectors=[
v.PipelineSourceConnectorSchema(
id=source_connector_id,
type="FILE_UPLOAD",
config={} # No special config needed for metadata
)
],
ai_platform_connector=v.PipelineAIPlatformConnectorSchema(
id=ai_platform_connector_id,
type="VECTORIZE",
config={}
),
destination_connector=v.PipelineDestinationConnectorSchema(
id=destination_connector_id,
type="VECTORIZE",
config={}
),
schedule=v.ScheduleSchema(type="manual")
)
# Create the pipeline
response = pipelines_api.create_pipeline(
organization_id,
pipeline_config
)
pipeline_id = response.data.id
print(f"✅ Created pipeline: {pipeline_id}")
print(f" Documents with metadata will be automatically processed")
print(f" Metadata will be preserved and searchable")
except Exception as e:
print(f"❌ Error creating pipeline: {e}")
raise
// Create pipelines API client
const pipelinesApi = new v.PipelinesApi(apiClient);
try {
// Create pipeline - metadata handling is automatic
const pipelineConfig = new v.PipelineConfigurationSchema({
pipelineName: "Metadata-Enhanced RAG Pipeline",
sourceConnectors: [
new v.PipelineSourceConnectorSchema({
id: sourceConnectorId,
type: "FILE_UPLOAD",
config: {} // No special config needed for metadata
})
],
aiPlatformConnector: new v.PipelineAIPlatformConnectorSchema({
id: aiPlatformConnectorId,
type: "VECTORIZE",
config: {}
}),
destinationConnector: new v.PipelineDestinationConnectorSchema({
id: destinationConnectorId,
type: "VECTORIZE",
config: {}
}),
schedule: new v.ScheduleSchema({ type: "manual" })
});
// Create the pipeline
const response = await pipelinesApi.createPipeline(
"your-org-id",
pipelineConfig
);
const pipelineId = response.data.id;
console.log(`✅ Created pipeline: ${pipelineId}`);
console.log(` Documents with metadata will be automatically processed`);
console.log(` Metadata will be preserved and searchable`);
} catch (error) {
console.error(`❌ Error creating pipeline: ${error}`);
throw error;
}
For automatic metadata extraction, refer to Automatic Metadata Extraction.
Step 4: Wait for Processing
Monitor your pipeline until it's ready:
- Python
- Node.js
import vectorize_client as v
import time
# Create pipelines API client
pipelines_api = v.PipelinesApi(apiClient)
print("Waiting for metadata extraction and indexing...")
max_wait_time = 300
start_time = time.time()
while True:
try:
pipeline = pipelines_api.get_pipeline(organization_id, pipeline_id)
status = pipeline.data.status
if status == "LISTENING":
print("✅ Pipeline ready with metadata indexes!")
break
elif status in ["ERROR_DEPLOYING", "SHUTDOWN"]:
print(f"❌ Pipeline error: {status}")
break
if time.time() - start_time > max_wait_time:
print("⏰ Timeout waiting for pipeline")
break
time.sleep(10)
except Exception as e:
print(f"❌ Error checking status: {e}")
break
// Create pipelines API client
const pipelinesApi = new v.PipelinesApi(apiClient);
console.log("Waiting for metadata extraction and indexing...");
const maxWaitTime = 300;
const startTime = Date.now();
while (true) {
try {
const pipeline = await pipelinesApi.getPipeline("your-org-id", pipelineId);
const status = pipeline.data.status;
if (status === "LISTENING") {
console.log("✅ Pipeline ready with metadata indexes!");
break;
} else if (["ERROR_DEPLOYING", "SHUTDOWN"].includes(status)) {
console.log(`❌ Pipeline error: ${status}`);
break;
}
if ((Date.now() - startTime) / 1000 > maxWaitTime) {
console.log("⏰ Timeout waiting for pipeline");
break;
}
await new Promise(resolve => setTimeout(resolve, 10000)); // Wait 10 seconds
} catch (error) {
console.error(`❌ Error checking status: ${error}`);
break;
}
}
Step 5: Query Without Metadata Filters
First, query without any filters to see baseline behavior:
- Python
- Node.js
import vectorize_client as v
# Create pipelines API client
pipelines_api = v.PipelinesApi(apiClient)
try:
# Query without any metadata filters
response = pipelines_api.retrieve_documents(
organization_id,
pipeline_id,
v.RetrieveDocumentsRequest(
question="What are the technical requirements for the AI search?",
num_results=5
)
)
# Display results
print(f"Query: 'What are the technical requirements for the AI search?'")
print(f"Results without filtering (searches all documents):\n")
for i, doc in enumerate(response.documents, 1):
print(f"Result {i}:")
print(f" Content: {doc.text[:150]}...")
print(f" Relevance Score: {doc.relevancy}")
print(f" Document ID: {doc.id}")
# Show metadata if available
if hasattr(doc, 'metadata') and doc.metadata:
print(f" Project: {doc.metadata.get('project', 'N/A')}")
print(f" Year: {doc.metadata.get('year', 'N/A')}")
print(f" Status: {doc.metadata.get('status', 'N/A')}")
print()
except Exception as e:
print(f"❌ Error querying pipeline: {e}")
raise
// Create pipelines API client
const pipelinesApi = new v.PipelinesApi(apiClient);
try {
// Query without any metadata filters
const response = await pipelinesApi.retrieveDocuments(
"your-org-id",
pipelineId,
new v.RetrieveDocumentsRequest({
question: "What are the technical requirements for the AI search?",
numResults: 5
})
);
// Display results
console.log("Query: 'What are the technical requirements for the AI search?'");
console.log("Results without filtering (searches all documents):\n");
response.documents.forEach((doc, i) => {
console.log(`Result ${i + 1}:`);
console.log(` Content: ${doc.text.substring(0, 150)}...`);
console.log(` Relevance Score: ${doc.relevancy}`);
console.log(` Document ID: ${doc.id}`);
// Show metadata if available
if (doc.metadata) {
console.log(` Project: ${doc.metadata.project || 'N/A'}`);
console.log(` Year: ${doc.metadata.year || 'N/A'}`);
console.log(` Status: ${doc.metadata.status || 'N/A'}`);
}
console.log();
});
} catch (error) {
console.error(`❌ Error querying pipeline: ${error}`);
throw error;
}
Without filters, your search might return:
- Marketing documents when you wanted technical specs.
- Draft content mixed with approved versions.
- Results from all departments.
Step 6: Query With Metadata Filters
Now query using metadata filters for precise results:
- Python
- Node.js
import vectorize_client as v
# Create pipelines API client
pipelines_api = v.PipelinesApi(apiClient)
try:
# Query with project and date-based filters
response = pipelines_api.retrieve_documents(
organization_id,
pipeline_id,
v.RetrieveDocumentsRequest(
question="What are the technical requirements for the AI search?",
num_results=5,
metadata_filters=[
{
"metadata.project": ["apollo", "mercury"] # Project tags
},
{
"metadata.year": ["2024", "2025"] # Target years
},
{
"metadata.status": ["approved", "published"] # Only finalized docs
}
]
)
)
# Display filtered results
print(f"Query: 'What are the technical requirements for the AI search?'")
print(f"Filters: project IN (apollo, mercury) AND year IN (2024, 2025) AND status IN (approved, published)")
print(f"Results (recent approved docs from specific projects):\n")
for i, doc in enumerate(response.documents, 1):
print(f"Result {i}:")
print(f" Content: {doc.text[:150]}...")
print(f" Relevance Score: {doc.relevancy}")
print(f" Document ID: {doc.id}")
# Show metadata to confirm filtering worked
if hasattr(doc, 'metadata') and doc.metadata:
print(f" Project: {doc.metadata.get('project', 'N/A')}")
print(f" Year: {doc.metadata.get('year', 'N/A')}")
print(f" Status: {doc.metadata.get('status', 'N/A')}")
print(f" Document Type: {doc.metadata.get('document_type', 'N/A')}")
print()
except Exception as e:
print(f"❌ Error querying pipeline: {e}")
raise
// Create pipelines API client
const pipelinesApi = new v.PipelinesApi(apiClient);
try {
// Query with project and date-based filters
const response = await pipelinesApi.retrieveDocuments(
"your-org-id",
pipelineId,
new v.RetrieveDocumentsRequest({
question: "What are the technical requirements for the AI search?",
numResults: 5,
metadataFilters: [
{
"metadata.project": ["apollo", "mercury"] // Project tags
},
{
"metadata.year": ["2024", "2025"] // Target years
},
{
"metadata.status": ["approved", "published"] // Only finalized docs
}
]
})
);
// Display filtered results
console.log("Query: 'What are the technical requirements for the AI search?'");
console.log("Filters: project IN (apollo, mercury) AND year IN (2024, 2025) AND status IN (approved, published)");
console.log("Results (recent approved docs from specific projects):\n");
response.documents.forEach((doc, i) => {
console.log(`Result ${i + 1}:`);
console.log(` Content: ${doc.text.substring(0, 150)}...`);
console.log(` Relevance Score: ${doc.relevancy}`);
console.log(` Document ID: ${doc.id}`);
// Show metadata to confirm filtering worked
if (doc.metadata) {
console.log(` Project: ${doc.metadata.project || 'N/A'}`);
console.log(` Year: ${doc.metadata.year || 'N/A'}`);
console.log(` Status: ${doc.metadata.status || 'N/A'}`);
console.log(` Document Type: ${doc.metadata.document_type || 'N/A'}`);
}
console.log();
});
} catch (error) {
console.error(`❌ Error querying pipeline: ${error}`);
throw error;
}
Metadata Filter Syntax
- Use the
metadata.
prefix for user-defined fields. - Provide values as arrays (even for single values).
- Multiple values for the same key use OR logic.
- Different keys use AND logic.
Example filter structure:
[
{ "metadata.project": ["apollo", "mercury"] },
{ "metadata.status": ["approved"] }
]
This filter returns documents from either apollo OR mercury projects that are also approved.
Step 7: Compare the Impact of Metadata
Let's see the dramatic difference metadata filtering makes. Here's what you might see:
Without Metadata Filters
Query: "What are our API rate limits?"
Results might include:
- Marketing blog post mentioning API limits
- Draft engineering spec with outdated limits
- Customer FAQ about rate limits
- Internal discussion about changing limits
With Metadata Filters
Query: "What are our API rate limits?"
Filters: project=apollo, status=approved
Results now include only:
- Approved apollo project documentation with current rate limits
- Official API specification documents from the apollo project
The filtered results are more accurate, authoritative, and relevant to your needs.
Using Visual Schema Editor (Optional)
For pipelines that use the Iris model, Vectorize includes a Visual Schema Editor that can automatically extract metadata based on defined schemas. This is especially useful when you have consistent document structures.
When to Use Automatic Metadata Extraction
- Structured documents: Technical specs, contracts, reports with consistent sections
- Standardized formats: Documents following templates
- Large volumes: When manual metadata tagging isn't practical
Enabling Automatic Extraction
- Navigate to your pipeline in the Vectorize platform
- Click on the Schema tab
- Use the Visual Schema Editor to define extraction rules
- Save and redeploy your pipeline
The schema editor allows you to:
- Define metadata fields to extract
- Set extraction rules based on document structure
- Preview extraction results
- Combine with manual metadata
Best Practices for Metadata
1. Keep It Consistent
# Good: Consistent types and values
metadata = {
"project": "apollo", # Always lowercase
"year": "2024", # Always string
"status": "approved" # Consistent values
}
# Bad: Inconsistent types and formats
metadata = {
"project": "Apollo", # Sometimes capital (BAD!)
"year": 2024, # Sometimes number
"status": "Approved" # Inconsistent casing
}
2. Plan Your Schema
Before uploading documents, decide on:
- Essential metadata fields (3-7 is usually optimal)
- Allowed values for each field
- Naming conventions (use lowercase with underscores)
3. Use Metadata for Business Logic
# Filter for recent, approved documents in selected projects
filters = [
{"metadata.project": ["apollo", "mercury"]},
{"metadata.year": ["2024", "2025"]},
{"metadata.status": ["approved", "published"]}
]
What's Next?
You've now built a metadata-enhanced RAG pipeline that can:
- Process documents with rich context
- Filter results based on business needs
- Provide more accurate, relevant answers
Next Steps
- For simple use cases: You're ready to deploy! Start uploading your documents with metadata.
- For complex scenarios: Explore automatic metadata extraction for large document sets.
Quick Tips
- Start with 3-5 metadata fields and expand as needed
- Test your metadata filters with diverse queries
- Monitor which metadata fields provide the most value
- Consider combining manual and automatic metadata extraction
Congratulations! You've learned how to make your AI significantly smarter with metadata. Your RAG pipeline can now provide contextual, filtered responses that match your specific business needs.