Skip to main content

Make Your AI Smarter with Metadata

In this guide, you'll learn how to enhance your retrieval pipeline with metadata to create more precise, context-aware results. By adding metadata to your documents, you can filter search results, improve relevance, and build retrieval pipelines with richer filtering and context for your connected LLM or application.

Prerequisites

Before you begin, you'll need:

  1. A Vectorize account
  2. An API access token (how to create one)
  3. Your organization ID (see below)

Finding your Organization ID

Your organization ID is in the Vectorize platform URL:

https://platform.vectorize.io/organization/[YOUR-ORG-ID]

For example, if your URL is:

https://platform.vectorize.io/organization/ecf3fa1d-30d0-4df1-8af6-f4852bc851cb

Your organization ID is: ecf3fa1d-30d0-4df1-8af6-f4852bc851cb

API Client Setup

import vectorize_client as v
import os

# Get credentials from environment variables
organization_id = os.environ.get("VECTORIZE_ORGANIZATION_ID")
api_key = os.environ.get("VECTORIZE_API_KEY")

if not organization_id or not api_key:
raise ValueError("Please set VECTORIZE_ORGANIZATION_ID and VECTORIZE_API_KEY environment variables")

# Initialize the API client
configuration = v.Configuration(
host="https://api.vectorize.io",
api_key={"ApiKeyAuth": api_key}
)
api = v.ApiClient(configuration)

print(f"✅ API client initialized for organization: {organization_id}")

What You'll Build

You'll create a RAG pipeline that processes documents with rich metadata, enabling you to:

  • Filter searches by project, document type, or any custom field
  • Build a context-aware retrieval pipeline that understands document relationships
  • Create project-based or category-specific search experiences

Understanding Metadata in RAG

Metadata is additional information about your documents that helps retrieval systems understand context. Think of it like labels on file folders - they help you find exactly what you need without opening every folder.

For example, a technical document might have metadata like:

  • Project: apollo
  • Document Type: requirements
  • Status: approved
  • Year: 2024
  • Priority: high

With metadata, your queries can answer things like:

  • "What are the apollo project requirements?" (filters by project)
  • "Show me approved documents from 2024" (filters by status and year)

Step 1: Create a File Upload Connector

First, create a connector to upload your documents:

import vectorize_client as v

# Create the connectors API client
connectors_api = v.SourceConnectorsApi(apiClient)

try:
# Create a file upload connector
file_upload = v.FileUpload(
name="metadata-enhanced-documents",
type="FILE_UPLOAD",
config={}
)

request = v.CreateSourceConnectorRequest(file_upload)
response = connectors_api.create_source_connector(
organization_id,
request
)

connector_id = response.connector.id
print(f"✅ Created file upload connector: {connector_id}")

except Exception as e:
print(f"❌ Error creating connector: {e}")
raise

Step 2: Upload Documents with Metadata

Now upload documents while attaching metadata as a JSON string:

import vectorize_client as v
import os
import json
import urllib3

# Create uploads API client
uploads_api = v.UploadsApi(apiClient)
http = urllib3.PoolManager()

# Example: Upload a document with its metadata
# Download sample files from: /files/metadata-sample-docs.zip

file_path = "product_requirements.txt"
# For testing: use full path if available
if 'test_data_dir' in locals():
file_path = str(test_data_dir / file_path)
metadata = {
"document_type": "requirements",
"project": "apollo",
"year": "2024",
"status": "approved",
"created_date": "2024-01-15",
"priority": "high"
}

try:
# Convert metadata to JSON string
metadata_json = json.dumps(metadata)

# Step 1: Get upload URL with metadata
upload_request = v.StartFileUploadToConnectorRequest(
name=file_path,
content_type="text/plain",
metadata=metadata_json # Metadata as JSON string
)

start_response = uploads_api.start_file_upload_to_connector(
organization_id,
source_connector_id,
start_file_upload_to_connector_request=upload_request
)

# Step 2: Upload file
with open(file_path, "rb") as f:
response = http.request(
"PUT",
start_response.upload_url,
body=f,
headers={
"Content-Type": "text/plain",
"Content-Length": str(os.path.getsize(file_path))
}
)

if response.status == 200:
print(f"✅ Uploaded: {file_path}")
print(f" Metadata: {list(metadata.keys())}")

except Exception as e:
print(f"❌ Error uploading {file_path}: {e}")

# Repeat for other documents with different metadata
# See the sample files for more examples

How Metadata Works

Vectorize stores your metadata with each chunk of your document, so filters are applied before retrieval returns results to your model.

When you upload a document with metadata:

  1. The metadata is stored as a JSON string during upload.
  2. Vectorize preserves this metadata with each chunk of your document.
  3. You can filter searches using these metadata fields.

Note: Metadata values should use consistent types across documents (e.g., "year": "2024" as a string everywhere).

Metadata Best Practices

  • Use consistent field names and types - Always use the same casing and data types
  • Keep values simple - Stick to strings and numbers for maximum compatibility
  • Plan your schema before uploading - Design your metadata structure upfront
  • Include only fields you'll use in queries - Avoid metadata bloat that won't be filtered on
  • Document your schema - Keep a reference of allowed fields and values for your team

Step 3: Create Your Pipeline

Create a pipeline just like in the first guide. No special configuration is required for user-defined metadata - Vectorize automatically handles it:

import vectorize_client as v

# Create pipelines API client
pipelines_api = v.PipelinesApi(apiClient)

try:
# Create pipeline - metadata handling is automatic
pipeline_config = v.PipelineConfigurationSchema(
pipeline_name="Metadata-Enhanced RAG Pipeline",
source_connectors=[
v.PipelineSourceConnectorSchema(
id=source_connector_id,
type="FILE_UPLOAD",
config={} # No special config needed for metadata
)
],
ai_platform_connector=v.PipelineAIPlatformConnectorSchema(
id=ai_platform_connector_id,
type="VECTORIZE",
config={}
),
destination_connector=v.PipelineDestinationConnectorSchema(
id=destination_connector_id,
type="VECTORIZE",
config={}
),
schedule=v.ScheduleSchema(type="manual")
)

# Create the pipeline
response = pipelines_api.create_pipeline(
organization_id,
pipeline_config
)

pipeline_id = response.data.id
print(f"✅ Created pipeline: {pipeline_id}")
print(f" Documents with metadata will be automatically processed")
print(f" Metadata will be preserved and searchable")

except Exception as e:
print(f"❌ Error creating pipeline: {e}")
raise

For automatic metadata extraction, refer to Automatic Metadata Extraction.

Step 4: Wait for Processing

Monitor your pipeline until it's ready:

import vectorize_client as v
import time

# Create pipelines API client
pipelines_api = v.PipelinesApi(apiClient)

print("Waiting for metadata extraction and indexing...")
max_wait_time = 300
start_time = time.time()

while True:
try:
pipeline = pipelines_api.get_pipeline(organization_id, pipeline_id)
status = pipeline.data.status

if status == "LISTENING":
print("✅ Pipeline ready with metadata indexes!")
break
elif status in ["ERROR_DEPLOYING", "SHUTDOWN"]:
print(f"❌ Pipeline error: {status}")
break

if time.time() - start_time > max_wait_time:
print("⏰ Timeout waiting for pipeline")
break

time.sleep(10)

except Exception as e:
print(f"❌ Error checking status: {e}")
break

Step 5: Query Without Metadata Filters

First, query without any filters to see baseline behavior:

import vectorize_client as v

# Create pipelines API client
pipelines_api = v.PipelinesApi(apiClient)

try:
# Query without any metadata filters
response = pipelines_api.retrieve_documents(
organization_id,
pipeline_id,
v.RetrieveDocumentsRequest(
question="What are the technical requirements for the AI search?",
num_results=5
)
)

# Display results
print(f"Query: 'What are the technical requirements for the AI search?'")
print(f"Results without filtering (searches all documents):\n")

for i, doc in enumerate(response.documents, 1):
print(f"Result {i}:")
print(f" Content: {doc.text[:150]}...")
print(f" Relevance Score: {doc.relevancy}")
print(f" Document ID: {doc.id}")
# Show metadata if available
if hasattr(doc, 'metadata') and doc.metadata:
print(f" Project: {doc.metadata.get('project', 'N/A')}")
print(f" Year: {doc.metadata.get('year', 'N/A')}")
print(f" Status: {doc.metadata.get('status', 'N/A')}")
print()

except Exception as e:
print(f"❌ Error querying pipeline: {e}")
raise

Without filters, your search might return:

  • Marketing documents when you wanted technical specs.
  • Draft content mixed with approved versions.
  • Results from all departments.

Step 6: Query With Metadata Filters

Now query using metadata filters for precise results:

import vectorize_client as v

# Create pipelines API client
pipelines_api = v.PipelinesApi(apiClient)

try:
# Query with project and date-based filters
response = pipelines_api.retrieve_documents(
organization_id,
pipeline_id,
v.RetrieveDocumentsRequest(
question="What are the technical requirements for the AI search?",
num_results=5,
metadata_filters=[
{
"metadata.project": ["apollo", "mercury"] # Project tags
},
{
"metadata.year": ["2024", "2025"] # Target years
},
{
"metadata.status": ["approved", "published"] # Only finalized docs
}
]
)
)

# Display filtered results
print(f"Query: 'What are the technical requirements for the AI search?'")
print(f"Filters: project IN (apollo, mercury) AND year IN (2024, 2025) AND status IN (approved, published)")
print(f"Results (recent approved docs from specific projects):\n")

for i, doc in enumerate(response.documents, 1):
print(f"Result {i}:")
print(f" Content: {doc.text[:150]}...")
print(f" Relevance Score: {doc.relevancy}")
print(f" Document ID: {doc.id}")
# Show metadata to confirm filtering worked
if hasattr(doc, 'metadata') and doc.metadata:
print(f" Project: {doc.metadata.get('project', 'N/A')}")
print(f" Year: {doc.metadata.get('year', 'N/A')}")
print(f" Status: {doc.metadata.get('status', 'N/A')}")
print(f" Document Type: {doc.metadata.get('document_type', 'N/A')}")
print()

except Exception as e:
print(f"❌ Error querying pipeline: {e}")
raise

Metadata Filter Syntax

  • Use the metadata. prefix for user-defined fields.
  • Provide values as arrays (even for single values).
  • Multiple values for the same key use OR logic.
  • Different keys use AND logic.

Example filter structure:

[
{ "metadata.project": ["apollo", "mercury"] },
{ "metadata.status": ["approved"] }
]

This filter returns documents from either apollo OR mercury projects that are also approved.

Step 7: Compare the Impact of Metadata

Let's see the dramatic difference metadata filtering makes. Here's what you might see:

Without Metadata Filters

Query: "What are our API rate limits?"

Results might include:
- Marketing blog post mentioning API limits
- Draft engineering spec with outdated limits
- Customer FAQ about rate limits
- Internal discussion about changing limits

With Metadata Filters

Query: "What are our API rate limits?"
Filters: project=apollo, status=approved

Results now include only:
- Approved apollo project documentation with current rate limits
- Official API specification documents from the apollo project

The filtered results are more accurate, authoritative, and relevant to your needs.

Using Visual Schema Editor (Optional)

For pipelines that use the Iris model, Vectorize includes a Visual Schema Editor that can automatically extract metadata based on defined schemas. This is especially useful when you have consistent document structures.

When to Use Automatic Metadata Extraction

  • Structured documents: Technical specs, contracts, reports with consistent sections
  • Standardized formats: Documents following templates
  • Large volumes: When manual metadata tagging isn't practical

Enabling Automatic Extraction

  1. Navigate to your pipeline in the Vectorize platform
  2. Click on the Schema tab
  3. Use the Visual Schema Editor to define extraction rules
  4. Save and redeploy your pipeline

The schema editor allows you to:

  • Define metadata fields to extract
  • Set extraction rules based on document structure
  • Preview extraction results
  • Combine with manual metadata

Best Practices for Metadata

1. Keep It Consistent

# Good: Consistent types and values
metadata = {
"project": "apollo", # Always lowercase
"year": "2024", # Always string
"status": "approved" # Consistent values
}

# Bad: Inconsistent types and formats
metadata = {
"project": "Apollo", # Sometimes capital (BAD!)
"year": 2024, # Sometimes number
"status": "Approved" # Inconsistent casing
}

2. Plan Your Schema

Before uploading documents, decide on:

  • Essential metadata fields (3-7 is usually optimal)
  • Allowed values for each field
  • Naming conventions (use lowercase with underscores)

3. Use Metadata for Business Logic

# Filter for recent, approved documents in selected projects
filters = [
{"metadata.project": ["apollo", "mercury"]},
{"metadata.year": ["2024", "2025"]},
{"metadata.status": ["approved", "published"]}
]

What's Next?

You've now built a metadata-enhanced RAG pipeline that can:

  • Process documents with rich context
  • Filter results based on business needs
  • Provide more accurate, relevant answers

Next Steps

  • For simple use cases: You're ready to deploy! Start uploading your documents with metadata.
  • For complex scenarios: Explore automatic metadata extraction for large document sets.

Quick Tips

  1. Start with 3-5 metadata fields and expand as needed
  2. Test your metadata filters with diverse queries
  3. Monitor which metadata fields provide the most value
  4. Consider combining manual and automatic metadata extraction

Congratulations! You've learned how to make your AI significantly smarter with metadata. Your RAG pipeline can now provide contextual, filtered responses that match your specific business needs.

Was this page helpful?