Make Your AI Smarter with Metadata
In this guide, you'll learn how to enhance your RAG pipeline with metadata to create more precise, context-aware AI applications. By adding metadata to your documents, you can filter search results, improve relevance, and build smarter AI experiences.
What You'll Build
You'll create a RAG pipeline that processes documents with rich metadata, enabling you to:
- Filter searches by department, document type, or any custom field
- Build context-aware AI that understands document relationships
- Create role-based or category-specific search experiences
Prerequisites
Before starting, make sure you have:
- Completed Build Your First RAG App
- A Vectorize account with API access
- Python 3.8+ or Node.js 16+ installed
- The Vectorize Python client (
pip install vectorize-client
) or JavaScript client (npm install @vectorize/client
)
Understanding Metadata in RAG
Metadata is additional information about your documents that helps your AI understand context. Think of it like labels on file folders - they help you find exactly what you need without opening every folder.
For example, a technical document might have metadata like:
- Department: Engineering
- Document Type: Requirements
- Status: Approved
- Last Updated: 2024-01-15
With metadata, your AI can answer questions like:
- "What are the engineering requirements?" (filters to engineering docs only)
- "Show me approved marketing strategies" (filters by status AND department)
- "Find recent product updates" (filters by date and type)
Step 1: Create a File Upload Connector
First, create a connector to upload your documents. This is the same as Level 1:
- Python
- Node.js
# Create pipelines client
pipelines_api = v.PipelinesApi(api)
# Define your pipeline configuration
pipeline_configuration = v.PipelineConfigurationSchema(
pipeline_name=pipeline_name,
source_connectors=[
v.PipelineSourceConnectorSchema(
id=source_connector_id,
type="FILE_UPLOAD",
config={}
)
],
ai_platform_connector=v.PipelineAIPlatformConnectorSchema(
id=ai_platform_connector_id,
type="VECTORIZE",
config={}
),
destination_connector=v.PipelineDestinationConnectorSchema(
id=destination_connector_id,
type="VECTORIZE",
config={}
),
schedule=v.ScheduleSchema(type="manual")
)
# Create the pipeline
try:
response = pipelines_api.create_pipeline(
organization_id,
pipeline_configuration
)
pipeline_id = response.data.id
print(f"Pipeline created successfully! ID: {pipeline_id}")
except Exception as e:
print(f"Error creating pipeline: {e}")
raise
const { PipelinesApi } = vectorize;
// Create pipelines client
const pipelinesApi = new PipelinesApi(apiConfig);
// Define your pipeline configuration
const pipelineConfiguration = {
pipelineName: pipelineName,
sourceConnectors: [
{
id: sourceConnectorId,
type: "FILE_UPLOAD",
config: {}
}
],
destinationConnector: {
id: destinationConnectorId,
type: "VECTORIZE",
config: {}
},
aiPlatformConnector: {
id: aiPlatformConnectorId,
type: "VECTORIZE",
config: {}
},
schedule: { type: "manual" }
};
// Create the pipeline
let pipelineId;
try {
const response = await pipelinesApi.createPipeline({
organizationId: organizationId,
pipelineConfigurationSchema: pipelineConfiguration
});
pipelineId = response.data.id;
console.log(`Pipeline created successfully! ID: ${pipelineId}`);
} catch (error) {
console.log(`Error creating pipeline: ${error.message}`);
throw error;
}
Step 2: Upload Documents with Metadata
Now comes the key difference - when uploading documents, you'll attach metadata as a JSON string. This metadata will be stored alongside your document chunks in the vector database.
- Python
- Node.js
# Create pipelines client
pipelines_api = v.PipelinesApi(api)
# Define your pipeline configuration
pipeline_configuration = v.PipelineConfigurationSchema(
pipeline_name=pipeline_name,
source_connectors=[
v.PipelineSourceConnectorSchema(
id=source_connector_id,
type="FILE_UPLOAD",
config={}
)
],
ai_platform_connector=v.PipelineAIPlatformConnectorSchema(
id=ai_platform_connector_id,
type="VECTORIZE",
config={}
),
destination_connector=v.PipelineDestinationConnectorSchema(
id=destination_connector_id,
type="VECTORIZE",
config={}
),
schedule=v.ScheduleSchema(type="manual")
)
# Create the pipeline
try:
response = pipelines_api.create_pipeline(
organization_id,
pipeline_configuration
)
pipeline_id = response.data.id
print(f"Pipeline created successfully! ID: {pipeline_id}")
except Exception as e:
print(f"Error creating pipeline: {e}")
raise
const { PipelinesApi } = vectorize;
// Create pipelines client
const pipelinesApi = new PipelinesApi(apiConfig);
// Define your pipeline configuration
const pipelineConfiguration = {
pipelineName: pipelineName,
sourceConnectors: [
{
id: sourceConnectorId,
type: "FILE_UPLOAD",
config: {}
}
],
destinationConnector: {
id: destinationConnectorId,
type: "VECTORIZE",
config: {}
},
aiPlatformConnector: {
id: aiPlatformConnectorId,
type: "VECTORIZE",
config: {}
},
schedule: { type: "manual" }
};
// Create the pipeline
let pipelineId;
try {
const response = await pipelinesApi.createPipeline({
organizationId: organizationId,
pipelineConfigurationSchema: pipelineConfiguration
});
pipelineId = response.data.id;
console.log(`Pipeline created successfully! ID: ${pipelineId}`);
} catch (error) {
console.log(`Error creating pipeline: ${error.message}`);
throw error;
}
How Metadata Works
When you upload a document with metadata:
- The metadata is stored as a JSON string during upload
- Vectorize preserves this metadata with each chunk of your document
- You can later filter searches using these metadata fields
Important: Metadata values should be consistent types across documents. For example, if "year" is a string in one document, it should be a string in all documents.
Step 3: Create Your Pipeline
Create a pipeline just like in Level 1. No special configuration is needed - Vectorize automatically handles metadata from your uploaded files:
- Python
- Node.js
# Create pipelines client
pipelines_api = v.PipelinesApi(api)
# Define your pipeline configuration
pipeline_configuration = v.PipelineConfigurationSchema(
pipeline_name=pipeline_name,
source_connectors=[
v.PipelineSourceConnectorSchema(
id=source_connector_id,
type="FILE_UPLOAD",
config={}
)
],
ai_platform_connector=v.PipelineAIPlatformConnectorSchema(
id=ai_platform_connector_id,
type="VECTORIZE",
config={}
),
destination_connector=v.PipelineDestinationConnectorSchema(
id=destination_connector_id,
type="VECTORIZE",
config={}
),
schedule=v.ScheduleSchema(type="manual")
)
# Create the pipeline
try:
response = pipelines_api.create_pipeline(
organization_id,
pipeline_configuration
)
pipeline_id = response.data.id
print(f"Pipeline created successfully! ID: {pipeline_id}")
except Exception as e:
print(f"Error creating pipeline: {e}")
raise
const { PipelinesApi } = vectorize;
// Create pipelines client
const pipelinesApi = new PipelinesApi(apiConfig);
// Define your pipeline configuration
const pipelineConfiguration = {
pipelineName: pipelineName,
sourceConnectors: [
{
id: sourceConnectorId,
type: "FILE_UPLOAD",
config: {}
}
],
destinationConnector: {
id: destinationConnectorId,
type: "VECTORIZE",
config: {}
},
aiPlatformConnector: {
id: aiPlatformConnectorId,
type: "VECTORIZE",
config: {}
},
schedule: { type: "manual" }
};
// Create the pipeline
let pipelineId;
try {
const response = await pipelinesApi.createPipeline({
organizationId: organizationId,
pipelineConfigurationSchema: pipelineConfiguration
});
pipelineId = response.data.id;
console.log(`Pipeline created successfully! ID: ${pipelineId}`);
} catch (error) {
console.log(`Error creating pipeline: ${error.message}`);
throw error;
}
Step 4: Wait for Processing
Monitor your pipeline until it's ready:
- Python
- Node.js
# Create pipelines client
pipelines_api = v.PipelinesApi(api)
# Define your pipeline configuration
pipeline_configuration = v.PipelineConfigurationSchema(
pipeline_name=pipeline_name,
source_connectors=[
v.PipelineSourceConnectorSchema(
id=source_connector_id,
type="FILE_UPLOAD",
config={}
)
],
ai_platform_connector=v.PipelineAIPlatformConnectorSchema(
id=ai_platform_connector_id,
type="VECTORIZE",
config={}
),
destination_connector=v.PipelineDestinationConnectorSchema(
id=destination_connector_id,
type="VECTORIZE",
config={}
),
schedule=v.ScheduleSchema(type="manual")
)
# Create the pipeline
try:
response = pipelines_api.create_pipeline(
organization_id,
pipeline_configuration
)
pipeline_id = response.data.id
print(f"Pipeline created successfully! ID: {pipeline_id}")
except Exception as e:
print(f"Error creating pipeline: {e}")
raise
const { PipelinesApi } = vectorize;
// Create pipelines client
const pipelinesApi = new PipelinesApi(apiConfig);
// Define your pipeline configuration
const pipelineConfiguration = {
pipelineName: pipelineName,
sourceConnectors: [
{
id: sourceConnectorId,
type: "FILE_UPLOAD",
config: {}
}
],
destinationConnector: {
id: destinationConnectorId,
type: "VECTORIZE",
config: {}
},
aiPlatformConnector: {
id: aiPlatformConnectorId,
type: "VECTORIZE",
config: {}
},
schedule: { type: "manual" }
};
// Create the pipeline
let pipelineId;
try {
const response = await pipelinesApi.createPipeline({
organizationId: organizationId,
pipelineConfigurationSchema: pipelineConfiguration
});
pipelineId = response.data.id;
console.log(`Pipeline created successfully! ID: ${pipelineId}`);
} catch (error) {
console.log(`Error creating pipeline: ${error.message}`);
throw error;
}
Step 5: Query Without Metadata Filters
First, let's see what happens when you search without any filters. This searches across all documents regardless of their metadata:
- Python
- Node.js
# Create pipelines client
pipelines_api = v.PipelinesApi(api)
# Define your pipeline configuration
pipeline_configuration = v.PipelineConfigurationSchema(
pipeline_name=pipeline_name,
source_connectors=[
v.PipelineSourceConnectorSchema(
id=source_connector_id,
type="FILE_UPLOAD",
config={}
)
],
ai_platform_connector=v.PipelineAIPlatformConnectorSchema(
id=ai_platform_connector_id,
type="VECTORIZE",
config={}
),
destination_connector=v.PipelineDestinationConnectorSchema(
id=destination_connector_id,
type="VECTORIZE",
config={}
),
schedule=v.ScheduleSchema(type="manual")
)
# Create the pipeline
try:
response = pipelines_api.create_pipeline(
organization_id,
pipeline_configuration
)
pipeline_id = response.data.id
print(f"Pipeline created successfully! ID: {pipeline_id}")
except Exception as e:
print(f"Error creating pipeline: {e}")
raise
const { PipelinesApi } = vectorize;
// Create pipelines client
const pipelinesApi = new PipelinesApi(apiConfig);
// Define your pipeline configuration
const pipelineConfiguration = {
pipelineName: pipelineName,
sourceConnectors: [
{
id: sourceConnectorId,
type: "FILE_UPLOAD",
config: {}
}
],
destinationConnector: {
id: destinationConnectorId,
type: "VECTORIZE",
config: {}
},
aiPlatformConnector: {
id: aiPlatformConnectorId,
type: "VECTORIZE",
config: {}
},
schedule: { type: "manual" }
};
// Create the pipeline
let pipelineId;
try {
const response = await pipelinesApi.createPipeline({
organizationId: organizationId,
pipelineConfigurationSchema: pipelineConfiguration
});
pipelineId = response.data.id;
console.log(`Pipeline created successfully! ID: ${pipelineId}`);
} catch (error) {
console.log(`Error creating pipeline: ${error.message}`);
throw error;
}
Without filters, your search might return:
- Marketing strategies when you wanted technical specs
- Draft documents mixed with approved ones
- Results from all departments and time periods
Step 6: Query With Metadata Filters
Now let's use metadata filters to get precise results. The retrieval endpoint supports exact match filtering with powerful logical operators:
- Python
- Node.js
# Create pipelines client
pipelines_api = v.PipelinesApi(api)
# Define your pipeline configuration
pipeline_configuration = v.PipelineConfigurationSchema(
pipeline_name=pipeline_name,
source_connectors=[
v.PipelineSourceConnectorSchema(
id=source_connector_id,
type="FILE_UPLOAD",
config={}
)
],
ai_platform_connector=v.PipelineAIPlatformConnectorSchema(
id=ai_platform_connector_id,
type="VECTORIZE",
config={}
),
destination_connector=v.PipelineDestinationConnectorSchema(
id=destination_connector_id,
type="VECTORIZE",
config={}
),
schedule=v.ScheduleSchema(type="manual")
)
# Create the pipeline
try:
response = pipelines_api.create_pipeline(
organization_id,
pipeline_configuration
)
pipeline_id = response.data.id
print(f"Pipeline created successfully! ID: {pipeline_id}")
except Exception as e:
print(f"Error creating pipeline: {e}")
raise
const { PipelinesApi } = vectorize;
// Create pipelines client
const pipelinesApi = new PipelinesApi(apiConfig);
// Define your pipeline configuration
const pipelineConfiguration = {
pipelineName: pipelineName,
sourceConnectors: [
{
id: sourceConnectorId,
type: "FILE_UPLOAD",
config: {}
}
],
destinationConnector: {
id: destinationConnectorId,
type: "VECTORIZE",
config: {}
},
aiPlatformConnector: {
id: aiPlatformConnectorId,
type: "VECTORIZE",
config: {}
},
schedule: { type: "manual" }
};
// Create the pipeline
let pipelineId;
try {
const response = await pipelinesApi.createPipeline({
organizationId: organizationId,
pipelineConfigurationSchema: pipelineConfiguration
});
pipelineId = response.data.id;
console.log(`Pipeline created successfully! ID: ${pipelineId}`);
} catch (error) {
console.log(`Error creating pipeline: ${error.message}`);
throw error;
}
Metadata Filter Syntax
The metadata filtering syntax is straightforward:
- Use
metadata.
prefix for user-defined metadata fields - Provide values as arrays (even for single values)
- Multiple values for the same key use OR logic
- Different keys use AND logic
Example filter combinations:
{
"metadata-filters": [
{ "metadata.department": ["engineering", "product"] }, // engineering OR product
{ "metadata.status": ["approved"] } // AND status = approved
]
}
Step 7: Understanding the Impact
- Python
- Node.js
# Create pipelines client
pipelines_api = v.PipelinesApi(api)
# Define your pipeline configuration
pipeline_configuration = v.PipelineConfigurationSchema(
pipeline_name=pipeline_name,
source_connectors=[
v.PipelineSourceConnectorSchema(
id=source_connector_id,
type="FILE_UPLOAD",
config={}
)
],
ai_platform_connector=v.PipelineAIPlatformConnectorSchema(
id=ai_platform_connector_id,
type="VECTORIZE",
config={}
),
destination_connector=v.PipelineDestinationConnectorSchema(
id=destination_connector_id,
type="VECTORIZE",
config={}
),
schedule=v.ScheduleSchema(type="manual")
)
# Create the pipeline
try:
response = pipelines_api.create_pipeline(
organization_id,
pipeline_configuration
)
pipeline_id = response.data.id
print(f"Pipeline created successfully! ID: {pipeline_id}")
except Exception as e:
print(f"Error creating pipeline: {e}")
raise
const { PipelinesApi } = vectorize;
// Create pipelines client
const pipelinesApi = new PipelinesApi(apiConfig);
// Define your pipeline configuration
const pipelineConfiguration = {
pipelineName: pipelineName,
sourceConnectors: [
{
id: sourceConnectorId,
type: "FILE_UPLOAD",
config: {}
}
],
destinationConnector: {
id: destinationConnectorId,
type: "VECTORIZE",
config: {}
},
aiPlatformConnector: {
id: aiPlatformConnectorId,
type: "VECTORIZE",
config: {}
},
schedule: { type: "manual" }
};
// Create the pipeline
let pipelineId;
try {
const response = await pipelinesApi.createPipeline({
organizationId: organizationId,
pipelineConfigurationSchema: pipelineConfiguration
});
pipelineId = response.data.id;
console.log(`Pipeline created successfully! ID: ${pipelineId}`);
} catch (error) {
console.log(`Error creating pipeline: ${error.message}`);
throw error;
}
Best Practices
Designing Effective Metadata
- Keep it consistent: Use the same field names and types across all documents
- Think about queries: Design metadata based on how users will search
- Don't overdo it: 5-10 well-chosen fields are better than 50 rarely-used ones
Common Metadata Patterns
Document Classification:
{
"category": "technical",
"subcategory": "api-docs",
"version": "2.1"
}
Organizational Structure:
{
"department": "engineering",
"team": "backend",
"project": "search-enhancement"
}
Temporal Information:
{
"created_date": "2024-01-15",
"quarter": "Q1-2024",
"fiscal_year": "2024"
}
Access Control:
{
"access_level": "public",
"audience": "developers",
"region": "north-america"
}
Limitations to Keep in Mind
Vectorize's retrieval endpoint currently supports:
- ✅ Exact match filtering only
- ✅ AND logic between different metadata keys
- ✅ OR logic between values for the same key
- ❌ No range queries (like date > "2024-01-01")
- ❌ No partial matches or wildcards
- ❌ No complex nested queries
If you need advanced filtering, consider using a bring-your-own vector database with native query capabilities.
What's Next?
Now that you've mastered metadata filtering, you're ready to build more sophisticated AI applications:
- Level 3: Build a Schema-Aware AI Agent - Create an agent that understands your data structures
- Understanding Metadata in RAG - Deep dive into all metadata types
- Automatic Metadata Extraction - Let AI extract metadata from your documents
Complete Code Example
Here's a complete example that ties everything together:
import vectorize_client as v
import json
import time
# Initialize the client
api_client = # Your initialization here
organization_id = "your-org-id"
# Create connectors and pipeline (see snippets above)
# ...
# Upload documents with metadata
documents = [
{
"filename": "q4_report.pdf",
"metadata": {
"department": "finance",
"document_type": "report",
"quarter": "Q4-2023",
"status": "final"
}
},
{
"filename": "api_guide.md",
"metadata": {
"department": "engineering",
"document_type": "documentation",
"version": "2.1",
"status": "published"
}
}
]
# Upload each document with its metadata
for doc in documents:
# See upload snippet for full implementation
pass
# Query with filters
filters = [
{ "metadata.department": ["engineering"] },
{ "metadata.status": ["published", "final"] }
]
# This will return only engineering documents that are published or final
Troubleshooting
Metadata not appearing in results?
- Ensure metadata is passed as a JSON string during upload
- Check that your pipeline has finished processing
- Verify field names match exactly (case-sensitive)
Filters not working as expected?
- Remember to use the
metadata.
prefix for user-defined fields - Use exact matches only - no wildcards or partial matches
- Check your AND/OR logic between filters
Getting too many/few results?
- Review your filter logic (AND between keys, OR within values)
- Consider if your metadata design matches your query patterns
- Test with fewer filters first, then add more
Congratulations! You've learned how to enhance your RAG pipeline with metadata for more intelligent, context-aware AI applications. Continue to Level 3 to build agents that can understand and work with structured data.