Skip to main content

Make Your AI Smarter with Metadata

In this guide, you'll learn how to enhance your RAG pipeline with metadata to create more precise, context-aware AI applications. By adding metadata to your documents, you can filter search results, improve relevance, and build smarter retrieval workflows.

Prerequisites

Before you begin, you'll need:

  1. A Vectorize account
  2. An API access token (how to create one)
  3. Your organization ID (see below)

Finding your Organization ID

Your organization ID is in the Vectorize platform URL:

https://platform.vectorize.io/organization/[YOUR-ORG-ID]

For example, if your URL is:

https://platform.vectorize.io/organization/ecf3fa1d-30d0-4df1-8af6-f4852bc851cb

Your organization ID is: ecf3fa1d-30d0-4df1-8af6-f4852bc851cb

API Client Setup

import vectorize_client as v
import os

# Get credentials from environment variables
organization_id = "your-org-id"
api_key = "your-api-key"

if not organization_id or not api_key:
raise ValueError("Please set VECTORIZE_ORGANIZATION_ID and VECTORIZE_API_KEY environment variables")

# Initialize the API client
configuration = v.Configuration(
host="https://api.vectorize.io",
api_key={"ApiKeyAuth": api_key}
)
api = v.ApiClient(configuration)

print(f"✅ API client initialized for organization: {"your-org-id"}")

What You'll Build

You'll create a RAG pipeline that processes documents with rich metadata, enabling you to:

  • Filter searches by department, document type, or any custom field
  • Build context-aware retrieval that understands document relationships
  • Create role-based or category-specific search experiences

Understanding Metadata in RAG

Metadata is additional information about your documents that helps retrieval systems understand context. Think of it like labels on file folders — they help you find exactly what you need without opening every folder.

For example, a technical document might have metadata like:

  • Department: Engineering
  • Document Type: Requirements
  • Status: Approved
  • Last Updated: 2025-01-15

With metadata, your queries can answer things like:

  • "What are the engineering requirements?" (filters by department)
  • "Show me approved marketing strategies" (filters by status and department)

Step 1: Create a File Upload Connector

First, create a connector to upload your documents:

# Create the connectors API client
connectors_api = v.SourceConnectorsApi(api)

try:
# Create a file upload connector
file_upload = v.FileUpload(
name="metadata-enhanced-documents",
type="FILE_UPLOAD",
config={}
)

request = v.CreateSourceConnectorRequest(file_upload)
response = connectors_api.create_source_connector(
organization_id,
request
)

connector_id = response.connector.id
print(f"✅ Created file upload connector: {connector_id}")

except Exception as e:
print(f"❌ Error creating connector: {e}")
raise

Step 2: Upload Documents with Metadata

Now upload documents while attaching metadata as a JSON string:

import urllib3

# Create uploads API client
uploads_api = v.UploadsApi(api)
http = urllib3.PoolManager()

# Example: Upload a document with its metadata
# Download sample files from: /files/metadata-sample-docs.zip

file_path = "product_requirements.txt"
metadata = {
"department": "engineering",
"document_type": "requirements",
"project": "ai-search",
"status": "draft",
"created_date": "2024-01-15",
"priority": "high"
}

try:
# Convert metadata to JSON string
metadata_json = json.dumps(metadata)

# Step 1: Get upload URL with metadata
upload_request = v.StartFileUploadToConnectorRequest(
name=file_path,
content_type="text/plain",
metadata=metadata_json # Metadata as JSON string
)

start_response = uploads_api.start_file_upload_to_connector(
organization_id,
source_connector_id,
start_file_upload_to_connector_request=upload_request
)

# Step 2: Upload file
with open(file_path, "rb") as f:
response = http.request(
"PUT",
start_response.upload_url,
body=f,
headers={
"Content-Type": "text/plain",
"Content-Length": str(os.path.getsize(file_path))
}
)

if response.status == 200:
print(f"✅ Uploaded: {file_path}")
print(f" Metadata: {list(metadata.keys())}")

except Exception as e:
print(f"❌ Error uploading {file_path}: {e}")

# Repeat for other documents with different metadata
# See the sample files for more examples

How Metadata Works

When you upload a document with metadata:

  1. The metadata is stored as a JSON string during upload.
  2. Vectorize preserves this metadata with each chunk of your document.
  3. You can filter searches using these metadata fields.

Note: Metadata values should use consistent types across documents (e.g., "year": "2024" as a string everywhere).

Step 3: Create Your Pipeline

Create a pipeline just like in the first guide. No special configuration is required for user-defined metadata — Vectorize automatically handles it:

# Create pipelines API client
pipelines_api = v.PipelinesApi(api)

try:
# Create pipeline - metadata handling is automatic
pipeline_config = v.PipelineConfigurationSchema(
pipeline_name="Metadata-Enhanced RAG Pipeline",
source_connectors=[
v.PipelineSourceConnectorSchema(
id=source_connector_id,
type="FILE_UPLOAD",
config={} # No special config needed for metadata
)
],
ai_platform_connector=v.PipelineAIPlatformConnectorSchema(
id=ai_platform_connector_id,
type="VECTORIZE",
config={}
),
destination_connector=v.PipelineDestinationConnectorSchema(
id=destination_connector_id,
type="VECTORIZE",
config={}
),
schedule=v.ScheduleSchema(type="manual")
)

# Create the pipeline
response = pipelines_api.create_pipeline(
organization_id,
pipeline_config
)

pipeline_id = response.data.id
print(f"✅ Created pipeline: {pipeline_id}")
print(f" Documents with metadata will be automatically processed")
print(f" Metadata will be preserved and searchable")

except Exception as e:
print(f"❌ Error creating pipeline: {e}")
raise

For automatic metadata extraction, refer to Automatic Metadata Extraction.

Step 4: Wait for Processing

Monitor your pipeline until it's ready:

# Create pipelines API client
pipelines_api = v.PipelinesApi(api)

print("Waiting for metadata extraction and indexing...")
max_wait_time = 300
start_time = time.time()

while True:
try:
pipeline = pipelines_api.get_pipeline(organization_id, pipeline_id)
status = pipeline.data.status

if status == "LISTENING":
print("✅ Pipeline ready with metadata indexes!")
break
elif status in ["ERROR_DEPLOYING", "SHUTDOWN"]:
print(f"❌ Pipeline error: {status}")
break

if time.time() - start_time > max_wait_time:
print("⏰ Timeout waiting for pipeline")
break

time.sleep(10)

except Exception as e:
print(f"❌ Error checking status: {e}")
break

Step 5: Query Without Metadata Filters

First, query without any filters to see baseline behavior:

# Create pipelines API client
pipelines_api = v.PipelinesApi(api)

try:
# Query without any metadata filters
response = pipelines_api.retrieve_documents(
organization_id,
pipeline_id,
v.RetrieveDocumentsRequest(
question="What are the technical requirements for the AI search?",
num_results=5
)
)

# Display results
print(f"Query: 'What are the technical requirements for the AI search?'")
print(f"Results without filtering (searches all documents):\n")

for i, doc in enumerate(response.documents, 1):
print(f"Result {i}:")
print(f" Content: {doc.text[:150]}...")
print(f" Relevance Score: {doc.relevancy}")
print(f" Document ID: {doc.id}")
# Show metadata if available
if hasattr(doc, 'metadata') and doc.metadata:
print(f" Department: {doc.metadata.get('department', 'N/A')}")
print(f" Document Type: {doc.metadata.get('document_type', 'N/A')}")
print()

except Exception as e:
print(f"❌ Error querying pipeline: {e}")
raise

Without filters, your search might return:

  • Marketing documents when you wanted technical specs.
  • Draft content mixed with approved versions.
  • Results from all departments.

Step 6: Query With Metadata Filters

Now query using metadata filters for precise results:

# Create pipelines API client
pipelines_api = v.PipelinesApi(api)

try:
# Query with metadata filters for engineering documents only
response = pipelines_api.retrieve_documents(
organization_id,
pipeline_id,
v.RetrieveDocumentsRequest(
question="What are the technical requirements for the AI search?",
num_results=5,
metadata_filters=[
{
"metadata.department": ["engineering"]
},
{
"metadata.document_type": ["requirements", "architecture"]
}
]
)
)

# Display filtered results
print(f"Query: 'What are the technical requirements for the AI search?'")
print(f"Filters: department=engineering AND type IN (requirements, architecture)")
print(f"Results (engineering docs only):\n")

for i, doc in enumerate(response.documents, 1):
print(f"Result {i}:")
print(f" Content: {doc.text[:150]}...")
print(f" Relevance Score: {doc.relevancy}")
print(f" Document ID: {doc.id}")
# Show metadata to confirm filtering worked
if hasattr(doc, 'metadata') and doc.metadata:
print(f" Department: {doc.metadata.get('department', 'N/A')}")
print(f" Document Type: {doc.metadata.get('document_type', 'N/A')}")
print(f" Project: {doc.metadata.get('project', 'N/A')}")
print()

except Exception as e:
print(f"❌ Error querying pipeline: {e}")
raise

Metadata Filter Syntax

  • Use the metadata. prefix for user-defined fields.
  • Provide values as arrays (even for single values).
  • Multiple values for the same key use OR logic.
  • Different keys use AND logic.

Example filter structure:

[
{ "metadata.department": ["engineering", "product"] },
{ "metadata.status": ["approved"] }
]

Was this page helpful?