Skip to main content

Make Your AI Smarter with Metadata

In this guide, you'll learn how to enhance your RAG pipeline with metadata to create more precise, context-aware AI applications. By adding metadata to your documents, you can filter search results, improve relevance, and build smarter AI experiences.

What You'll Build

You'll create a RAG pipeline that processes documents with rich metadata, enabling you to:

  • Filter searches by department, document type, or any custom field
  • Build context-aware AI that understands document relationships
  • Create role-based or category-specific search experiences

Prerequisites

Before starting, make sure you have:

  • Completed Build Your First RAG App
  • A Vectorize account with API access
  • Python 3.8+ or Node.js 16+ installed
  • The Vectorize Python client (pip install vectorize-client) or JavaScript client (npm install @vectorize/client)

Understanding Metadata in RAG

Metadata is additional information about your documents that helps your AI understand context. Think of it like labels on file folders - they help you find exactly what you need without opening every folder.

For example, a technical document might have metadata like:

  • Department: Engineering
  • Document Type: Requirements
  • Status: Approved
  • Last Updated: 2024-01-15

With metadata, your AI can answer questions like:

  • "What are the engineering requirements?" (filters to engineering docs only)
  • "Show me approved marketing strategies" (filters by status AND department)
  • "Find recent product updates" (filters by date and type)

Step 1: Create a File Upload Connector

First, create a connector to upload your documents. This is the same as Level 1:

# Create pipelines client
pipelines_api = v.PipelinesApi(api)

# Define your pipeline configuration
pipeline_configuration = v.PipelineConfigurationSchema(
pipeline_name=pipeline_name,
source_connectors=[
v.PipelineSourceConnectorSchema(
id=source_connector_id,
type="FILE_UPLOAD",
config={}
)
],
ai_platform_connector=v.PipelineAIPlatformConnectorSchema(
id=ai_platform_connector_id,
type="VECTORIZE",
config={}
),
destination_connector=v.PipelineDestinationConnectorSchema(
id=destination_connector_id,
type="VECTORIZE",
config={}
),
schedule=v.ScheduleSchema(type="manual")
)

# Create the pipeline
try:
response = pipelines_api.create_pipeline(
organization_id,
pipeline_configuration
)

pipeline_id = response.data.id
print(f"Pipeline created successfully! ID: {pipeline_id}")

except Exception as e:
print(f"Error creating pipeline: {e}")
raise

Step 2: Upload Documents with Metadata

Now comes the key difference - when uploading documents, you'll attach metadata as a JSON string. This metadata will be stored alongside your document chunks in the vector database.

# Create pipelines client
pipelines_api = v.PipelinesApi(api)

# Define your pipeline configuration
pipeline_configuration = v.PipelineConfigurationSchema(
pipeline_name=pipeline_name,
source_connectors=[
v.PipelineSourceConnectorSchema(
id=source_connector_id,
type="FILE_UPLOAD",
config={}
)
],
ai_platform_connector=v.PipelineAIPlatformConnectorSchema(
id=ai_platform_connector_id,
type="VECTORIZE",
config={}
),
destination_connector=v.PipelineDestinationConnectorSchema(
id=destination_connector_id,
type="VECTORIZE",
config={}
),
schedule=v.ScheduleSchema(type="manual")
)

# Create the pipeline
try:
response = pipelines_api.create_pipeline(
organization_id,
pipeline_configuration
)

pipeline_id = response.data.id
print(f"Pipeline created successfully! ID: {pipeline_id}")

except Exception as e:
print(f"Error creating pipeline: {e}")
raise

How Metadata Works

When you upload a document with metadata:

  1. The metadata is stored as a JSON string during upload
  2. Vectorize preserves this metadata with each chunk of your document
  3. You can later filter searches using these metadata fields

Important: Metadata values should be consistent types across documents. For example, if "year" is a string in one document, it should be a string in all documents.

Step 3: Create Your Pipeline

Create a pipeline just like in Level 1. No special configuration is needed - Vectorize automatically handles metadata from your uploaded files:

# Create pipelines client
pipelines_api = v.PipelinesApi(api)

# Define your pipeline configuration
pipeline_configuration = v.PipelineConfigurationSchema(
pipeline_name=pipeline_name,
source_connectors=[
v.PipelineSourceConnectorSchema(
id=source_connector_id,
type="FILE_UPLOAD",
config={}
)
],
ai_platform_connector=v.PipelineAIPlatformConnectorSchema(
id=ai_platform_connector_id,
type="VECTORIZE",
config={}
),
destination_connector=v.PipelineDestinationConnectorSchema(
id=destination_connector_id,
type="VECTORIZE",
config={}
),
schedule=v.ScheduleSchema(type="manual")
)

# Create the pipeline
try:
response = pipelines_api.create_pipeline(
organization_id,
pipeline_configuration
)

pipeline_id = response.data.id
print(f"Pipeline created successfully! ID: {pipeline_id}")

except Exception as e:
print(f"Error creating pipeline: {e}")
raise

Step 4: Wait for Processing

Monitor your pipeline until it's ready:

# Create pipelines client
pipelines_api = v.PipelinesApi(api)

# Define your pipeline configuration
pipeline_configuration = v.PipelineConfigurationSchema(
pipeline_name=pipeline_name,
source_connectors=[
v.PipelineSourceConnectorSchema(
id=source_connector_id,
type="FILE_UPLOAD",
config={}
)
],
ai_platform_connector=v.PipelineAIPlatformConnectorSchema(
id=ai_platform_connector_id,
type="VECTORIZE",
config={}
),
destination_connector=v.PipelineDestinationConnectorSchema(
id=destination_connector_id,
type="VECTORIZE",
config={}
),
schedule=v.ScheduleSchema(type="manual")
)

# Create the pipeline
try:
response = pipelines_api.create_pipeline(
organization_id,
pipeline_configuration
)

pipeline_id = response.data.id
print(f"Pipeline created successfully! ID: {pipeline_id}")

except Exception as e:
print(f"Error creating pipeline: {e}")
raise

Step 5: Query Without Metadata Filters

First, let's see what happens when you search without any filters. This searches across all documents regardless of their metadata:

# Create pipelines client
pipelines_api = v.PipelinesApi(api)

# Define your pipeline configuration
pipeline_configuration = v.PipelineConfigurationSchema(
pipeline_name=pipeline_name,
source_connectors=[
v.PipelineSourceConnectorSchema(
id=source_connector_id,
type="FILE_UPLOAD",
config={}
)
],
ai_platform_connector=v.PipelineAIPlatformConnectorSchema(
id=ai_platform_connector_id,
type="VECTORIZE",
config={}
),
destination_connector=v.PipelineDestinationConnectorSchema(
id=destination_connector_id,
type="VECTORIZE",
config={}
),
schedule=v.ScheduleSchema(type="manual")
)

# Create the pipeline
try:
response = pipelines_api.create_pipeline(
organization_id,
pipeline_configuration
)

pipeline_id = response.data.id
print(f"Pipeline created successfully! ID: {pipeline_id}")

except Exception as e:
print(f"Error creating pipeline: {e}")
raise

Without filters, your search might return:

  • Marketing strategies when you wanted technical specs
  • Draft documents mixed with approved ones
  • Results from all departments and time periods

Step 6: Query With Metadata Filters

Now let's use metadata filters to get precise results. The retrieval endpoint supports exact match filtering with powerful logical operators:

# Create pipelines client
pipelines_api = v.PipelinesApi(api)

# Define your pipeline configuration
pipeline_configuration = v.PipelineConfigurationSchema(
pipeline_name=pipeline_name,
source_connectors=[
v.PipelineSourceConnectorSchema(
id=source_connector_id,
type="FILE_UPLOAD",
config={}
)
],
ai_platform_connector=v.PipelineAIPlatformConnectorSchema(
id=ai_platform_connector_id,
type="VECTORIZE",
config={}
),
destination_connector=v.PipelineDestinationConnectorSchema(
id=destination_connector_id,
type="VECTORIZE",
config={}
),
schedule=v.ScheduleSchema(type="manual")
)

# Create the pipeline
try:
response = pipelines_api.create_pipeline(
organization_id,
pipeline_configuration
)

pipeline_id = response.data.id
print(f"Pipeline created successfully! ID: {pipeline_id}")

except Exception as e:
print(f"Error creating pipeline: {e}")
raise

Metadata Filter Syntax

The metadata filtering syntax is straightforward:

  • Use metadata. prefix for user-defined metadata fields
  • Provide values as arrays (even for single values)
  • Multiple values for the same key use OR logic
  • Different keys use AND logic

Example filter combinations:

{
"metadata-filters": [
{ "metadata.department": ["engineering", "product"] }, // engineering OR product
{ "metadata.status": ["approved"] } // AND status = approved
]
}

Step 7: Understanding the Impact

# Create pipelines client
pipelines_api = v.PipelinesApi(api)

# Define your pipeline configuration
pipeline_configuration = v.PipelineConfigurationSchema(
pipeline_name=pipeline_name,
source_connectors=[
v.PipelineSourceConnectorSchema(
id=source_connector_id,
type="FILE_UPLOAD",
config={}
)
],
ai_platform_connector=v.PipelineAIPlatformConnectorSchema(
id=ai_platform_connector_id,
type="VECTORIZE",
config={}
),
destination_connector=v.PipelineDestinationConnectorSchema(
id=destination_connector_id,
type="VECTORIZE",
config={}
),
schedule=v.ScheduleSchema(type="manual")
)

# Create the pipeline
try:
response = pipelines_api.create_pipeline(
organization_id,
pipeline_configuration
)

pipeline_id = response.data.id
print(f"Pipeline created successfully! ID: {pipeline_id}")

except Exception as e:
print(f"Error creating pipeline: {e}")
raise

Best Practices

Designing Effective Metadata

  1. Keep it consistent: Use the same field names and types across all documents
  2. Think about queries: Design metadata based on how users will search
  3. Don't overdo it: 5-10 well-chosen fields are better than 50 rarely-used ones

Common Metadata Patterns

Document Classification:

{
"category": "technical",
"subcategory": "api-docs",
"version": "2.1"
}

Organizational Structure:

{
"department": "engineering",
"team": "backend",
"project": "search-enhancement"
}

Temporal Information:

{
"created_date": "2024-01-15",
"quarter": "Q1-2024",
"fiscal_year": "2024"
}

Access Control:

{
"access_level": "public",
"audience": "developers",
"region": "north-america"
}

Limitations to Keep in Mind

Vectorize's retrieval endpoint currently supports:

  • ✅ Exact match filtering only
  • ✅ AND logic between different metadata keys
  • ✅ OR logic between values for the same key
  • ❌ No range queries (like date > "2024-01-01")
  • ❌ No partial matches or wildcards
  • ❌ No complex nested queries

If you need advanced filtering, consider using a bring-your-own vector database with native query capabilities.

What's Next?

Now that you've mastered metadata filtering, you're ready to build more sophisticated AI applications:

Complete Code Example

Here's a complete example that ties everything together:

import vectorize_client as v
import json
import time

# Initialize the client
api_client = # Your initialization here
organization_id = "your-org-id"

# Create connectors and pipeline (see snippets above)
# ...

# Upload documents with metadata
documents = [
{
"filename": "q4_report.pdf",
"metadata": {
"department": "finance",
"document_type": "report",
"quarter": "Q4-2023",
"status": "final"
}
},
{
"filename": "api_guide.md",
"metadata": {
"department": "engineering",
"document_type": "documentation",
"version": "2.1",
"status": "published"
}
}
]

# Upload each document with its metadata
for doc in documents:
# See upload snippet for full implementation
pass

# Query with filters
filters = [
{ "metadata.department": ["engineering"] },
{ "metadata.status": ["published", "final"] }
]

# This will return only engineering documents that are published or final

Troubleshooting

Metadata not appearing in results?

  • Ensure metadata is passed as a JSON string during upload
  • Check that your pipeline has finished processing
  • Verify field names match exactly (case-sensitive)

Filters not working as expected?

  • Remember to use the metadata. prefix for user-defined fields
  • Use exact matches only - no wildcards or partial matches
  • Check your AND/OR logic between filters

Getting too many/few results?

  • Review your filter logic (AND between keys, OR within values)
  • Consider if your metadata design matches your query patterns
  • Test with fewer filters first, then add more

Congratulations! You've learned how to enhance your RAG pipeline with metadata for more intelligent, context-aware AI applications. Continue to Level 3 to build agents that can understand and work with structured data.

Was this page helpful?