Make Your AI Smarter with Metadata

In this guide, you'll learn how to enhance your retrieval pipeline with metadata to create more precise, context-aware results. By adding metadata to your documents, you can filter search results, improve relevance, and build retrieval pipelines with richer filtering and context for your connected LLM or application.

Prerequisites

Before you begin, you'll need:

A Vectorize account
An API access token (how to create one)
Your organization ID (see below)

Finding your Organization ID

Your organization ID is in the Vectorize platform URL:

https://platform.vectorize.io/organization/[YOUR-ORG-ID]

For example, if your URL is:

https://platform.vectorize.io/organization/ecf3fa1d-30d0-4df1-8af6-f4852bc851cb

Your organization ID is: ecf3fa1d-30d0-4df1-8af6-f4852bc851cb

API Client Setup

Python
Node.js
import vectorize_client as v
import os

# Get credentials from environment variables
organization_id = os.environ.get("VECTORIZE_ORGANIZATION_ID")
api_key = os.environ.get("VECTORIZE_API_KEY")

if not organization_id or not api_key:
    raise ValueError("Please set VECTORIZE_ORGANIZATION_ID and VECTORIZE_API_KEY environment variables")

# Initialize the API client
configuration = v.Configuration(
    host="https://api.vectorize.io",
    api_key={"ApiKeyAuth": api_key}
)
api = v.ApiClient(configuration)

print(f"✅ API client initialized for organization: {organization_id}")
const vectorize = require('@vectorize-io/vectorize-client')

// COMPLETE_EXAMPLE_PREREQUISITES:
// - env_vars: VECTORIZE_API_KEY, VECTORIZE_ORGANIZATION_ID
// - description: Initialize the Vectorize API client for making API calls


// Get credentials from environment variables
const organizationId = process.env.VECTORIZE_ORGANIZATION_ID;
const apiKey = process.env.VECTORIZE_API_KEY;

if (!organizationId || !apiKey) {
    throw new Error("Please set VECTORIZE_ORGANIZATION_ID and VECTORIZE_API_KEY environment variables");
}

// Initialize the API client
const configuration = new vectorize.Configuration({
    basePath: 'https://api.vectorize.io',
    accessToken: apiKey
});
const apiClient = new vectorize.ApiClient(configuration);

console.log(`✅ API client initialized for organization: ${organizationId}`);

What You'll Build

You'll create a RAG pipeline that processes documents with rich metadata, enabling you to:

Filter searches by project, document type, or any custom field
Build a context-aware retrieval pipeline that understands document relationships
Create project-based or category-specific search experiences

Understanding Metadata in RAG

Metadata is additional information about your documents that helps retrieval systems understand context. Think of it like labels on file folders - they help you find exactly what you need without opening every folder.

For example, a technical document might have metadata like:

Project: apollo
Document Type: requirements
Status: approved
Year: 2024
Priority: high

With metadata, your queries can answer things like:

"What are the apollo project requirements?" (filters by project)
"Show me approved documents from 2024" (filters by status and year)

Step 1: Create a File Upload Connector

First, create a connector to upload your documents:

Python
Node.js
import vectorize_client as v

# Create the connectors API client
connectors_api = v.SourceConnectorsApi(apiClient)

try:
    # Create a file upload connector
    file_upload = v.FileUpload(
        name="metadata-enhanced-documents",
        type="FILE_UPLOAD",
        config={}
    )

    request = v.CreateSourceConnectorRequest(file_upload)
    response = connectors_api.create_source_connector(
        organization_id,
        request
    )

    connector_id = response.connector.id
    print(f"✅ Created file upload connector: {connector_id}")

except Exception as e:
    print(f"❌ Error creating connector: {e}")
    raise
// This snippet uses async operations and should be run in an async context
(async () => {
    const v = require('@vectorize-io/vectorize-client')

    // Create the connectors API client
    const connectorsApi = new vectorize.SourceConnectorsApi(apiClient);

    let sourceConnectorId;
    try {
        // Create a file upload connector
        const fileUpload = {
            name: "metadata-enhanced-documents",
            type: "FILE_UPLOAD",
            config: {}
        };

        const response = await connectorsApi.createSourceConnector({
            organizationId: "your-org-id",
            createSourceConnectorRequest: fileUpload
        });

        sourceConnectorId = response.connector.id;
        console.log(`✅ Created file upload connector: ${sourceConnectorId}`);

    } catch (error) {
        console.error(`❌ Error creating connector: ${error}`);
        throw error;
    }
})();

Step 2: Upload Documents with Metadata

Now upload documents while attaching metadata as a JSON string:

Python
Node.js
import vectorize_client as v
import os
import json
import urllib3

# Create uploads API client
uploads_api = v.UploadsApi(apiClient)
http = urllib3.PoolManager()

# Example: Upload a document with its metadata
# Download sample files from: /files/metadata-sample-docs.zip

file_path = "metadata-sample-docs/product_requirements.txt"
# For testing: use full path if available
if 'test_data_dir' in locals():
    file_path = str(test_data_dir / file_path)
metadata = {
    "document_type": "requirements",
    "project": "apollo",
    "year": "2024",
    "status": "approved",
    "created_date": "2024-01-15",
    "priority": "high"
}

try:
    # Convert metadata to JSON string
    metadata_json = json.dumps(metadata)

    # Step 1: Get upload URL with metadata
    upload_request = v.StartFileUploadToConnectorRequest(
        name=os.path.basename(file_path),  # Just the filename, not the full path
        content_type="text/plain",
        metadata=metadata_json  # Metadata as JSON string
    )

    start_response = uploads_api.start_file_upload_to_connector(
        organization_id,
        source_connector_id,
        start_file_upload_to_connector_request=upload_request
    )

    # Step 2: Upload file
    with open(file_path, "rb") as f:
        response = http.request(
            "PUT",
            start_response.upload_url,
            body=f,
            headers={
                "Content-Type": "text/plain",
                "Content-Length": str(os.path.getsize(file_path))
            }
        )

    if response.status == 200:
        print(f"✅ Uploaded: {file_path}")
        print(f"   Metadata: {list(metadata.keys())}")

except Exception as e:
    print(f"❌ Error uploading {file_path}: {e}")

# Repeat for other documents with different metadata
# See the sample files for more examples
// This snippet uses async operations and should be run in an async context
(async () => {
    const v = require('@vectorize-io/vectorize-client')
    const fs = require('fs')

    // Create uploads API client
    const uploadsApi = new vectorize.UploadsApi(apiClient);

    // Example: Upload a document with its metadata
    // Download sample files from: /files/metadata-sample-docs.zip

    // The file to upload
    let filePath = "product_requirements.txt";
    const fileName = "product_requirements.txt";
    // For testing: use full path if testDataDir is available
    if (typeof testDataDir !== 'undefined') {
        filePath = path.join(testDataDir, filePath);
    }

    // Metadata for this document
    const metadata = {
        document_type: "requirements",
        project: "apollo",
        year: "2024",
        status: "approved",
        created_date: "2024-01-15",
        priority: "high"
    };

    try {
        // Convert metadata to JSON string
        const metadataJson = JSON.stringify(metadata);

        // Step 1: Get upload URL with metadata
        const uploadRequest = {
            name: fileName,
            contentType: "text/plain",
            metadata: metadataJson  // Metadata as JSON string
        };

        const startResponse = await uploadsApi.startFileUploadToConnector({
            organizationId: "your-org-id",
            connectorId: sourceConnectorId,
            startFileUploadToConnectorRequest: uploadRequest
        });

        // Step 2: Upload file
        const fileContent = fs.readFileSync(filePath);
        const uploadResponse = await fetch(startResponse.uploadUrl, {
            method: 'PUT',
            body: fileContent,
            headers: {
                'Content-Type': 'text/plain',
                'Content-Length': fs.statSync(filePath).size
            }
        });

        if (uploadResponse.ok) {
            console.log(`✅ Uploaded: ${fileName}`);
            console.log(`   Metadata: ${Object.keys(metadata).join(', ')}`);
        } else {
            console.log(`❌ Upload failed: ${uploadResponse.status}`);
        }

    } catch (error) {
        console.error(`❌ Error uploading ${fileName}: ${error}`);
    }

    // Repeat for other documents with different metadata
    // See the sample files for more examples
})();

How Metadata Works

Vectorize stores your metadata with each chunk of your document, so filters are applied before retrieval returns results to your model.

When you upload a document with metadata:

The metadata is stored as a JSON string during upload.
Vectorize preserves this metadata with each chunk of your document.
You can filter searches using these metadata fields.

Note: Metadata values should use consistent types across documents (e.g., "year": "2024" as a string everywhere).

Metadata Best Practices

Use consistent field names and types - Always use the same casing and data types
Keep values simple - Stick to strings and numbers for maximum compatibility
Plan your schema before uploading - Design your metadata structure upfront
Include only fields you'll use in queries - Avoid metadata bloat that won't be filtered on
Document your schema - Keep a reference of allowed fields and values for your team

Step 3: Create Your Pipeline

Create a pipeline just like in the first guide. No special configuration is required for user-defined metadata - Vectorize automatically handles it:

Python
Node.js
import vectorize_client as v

# Create pipelines API client
pipelines_api = v.PipelinesApi(apiClient)

try:
    # Create pipeline - metadata handling is automatic
    pipeline_config = v.PipelineConfigurationSchema(
        pipeline_name="Metadata-Enhanced RAG Pipeline",
        source_connectors=[
            v.PipelineSourceConnectorSchema(
                id=source_connector_id,
                type="FILE_UPLOAD",
                config={}  # No special config needed for metadata
            )
        ],
        ai_platform_connector=v.PipelineAIPlatformConnectorSchema(
            id=ai_platform_connector_id,
            type="VECTORIZE",
            config={}
        ),
        destination_connector=v.PipelineDestinationConnectorSchema(
            id=destination_connector_id,
            type="VECTORIZE",
            config={}
        ),
        schedule=v.ScheduleSchema(type="manual")
    )

    # Create the pipeline
    response = pipelines_api.create_pipeline(
        organization_id,
        pipeline_config
    )

    pipeline_id = response.data.id
    print(f"✅ Created pipeline: {pipeline_id}")
    print(f"   Documents with metadata will be automatically processed")
    print(f"   Metadata will be preserved and searchable")

except Exception as e:
    print(f"❌ Error creating pipeline: {e}")
    raise
// This snippet uses async operations and should be run in an async context
(async () => {
    const v = require('@vectorize-io/vectorize-client')

    // Create pipelines API client
    const pipelinesApi = new vectorize.PipelinesApi(apiClient);

    let pipelineId;
    try {
        // Create pipeline - metadata handling is automatic
        const pipelineConfig = {
            pipelineName: "Metadata-Enhanced RAG Pipeline",
            sourceConnectors: [
                {
                    id: sourceConnectorId,
                    type: "FILE_UPLOAD",
                    config: {}  // No special config needed for metadata
                }
            ],
            aiPlatformConnector: {
                id: aiPlatformConnectorId,
                type: "VECTORIZE",
                config: {}
            },
            destinationConnector: {
                id: destinationConnectorId,
                type: "VECTORIZE",
                config: {}
            },
            schedule: { type: "manual" }
        };

        // Create the pipeline
        const response = await pipelinesApi.createPipeline({
            organizationId: "your-org-id",
            pipelineConfigurationSchema: pipelineConfig
        });

        pipelineId = response.data.id;
        console.log(`✅ Created pipeline: ${pipelineId}`);
        console.log(`   Documents with metadata will be automatically processed`);
        console.log(`   Metadata will be preserved and searchable`);

    } catch (error) {
        console.error(`❌ Error creating pipeline: ${error}`);
        throw error;
    }
})();

For automatic metadata extraction, refer to Automatic Metadata Extraction.

Step 4: Wait for Processing

Monitor your pipeline until it's ready:

Python
Node.js
import vectorize_client as v
import time

# Create pipelines API client
pipelines_api = v.PipelinesApi(apiClient)

print("Waiting for metadata extraction and indexing...")
max_wait_time = 300
start_time = time.time()

while True:
    try:
        pipeline = pipelines_api.get_pipeline(organization_id, pipeline_id)
        status = pipeline.data.status

        if status == "LISTENING":
            print("✅ Pipeline ready with metadata indexes!")
            break
        elif status in ["ERROR_DEPLOYING", "SHUTDOWN"]:
            print(f"❌ Pipeline error: {status}")
            break

        if time.time() - start_time > max_wait_time:
            print("⏰ Timeout waiting for pipeline")
            break

        time.sleep(10)

    except Exception as e:
        print(f"❌ Error checking status: {e}")
        break
// This snippet uses async operations and should be run in an async context
(async () => {
    const v = require('@vectorize-io/vectorize-client')

    // Create pipelines API client
    const pipelinesApi = new vectorize.PipelinesApi(apiClient);

    console.log("Waiting for metadata extraction and indexing...");
    const maxWaitTime = 300;
    const startTime = Date.now();

    while (true) {
        try {
            const pipeline = await pipelinesApi.getPipeline({
                organizationId: "your-org-id",
                pipelineId: pipelineId
            });
            const status = pipeline.data.status;

            if (status === "LISTENING") {
                console.log("✅ Pipeline ready with metadata indexes!");
                break;
            } else if (["ERROR_DEPLOYING", "SHUTDOWN"].includes(status)) {
                console.log(`❌ Pipeline error: ${status}`);
                break;
            }

            if ((Date.now() - startTime) / 1000 > maxWaitTime) {
                console.log("⏰ Timeout waiting for pipeline");
                break;
            }

            await new Promise(resolve => setTimeout(resolve, 10000)); // Wait 10 seconds

        } catch (error) {
            console.error(`❌ Error checking status: ${error}`);
            break;
        }
    }
})();

Step 5: Query Without Metadata Filters

First, query without any filters to see baseline behavior:

Python
Node.js
import vectorize_client as v

# Create pipelines API client
pipelines_api = v.PipelinesApi(apiClient)

try:
    # Query without any metadata filters
    response = pipelines_api.retrieve_documents(
        organization_id,
        pipeline_id,
        v.RetrieveDocumentsRequest(
            question="What are the technical requirements for the AI search?",
            num_results=5
        )
    )

    # Display results
    print(f"Query: 'What are the technical requirements for the AI search?'")
    print(f"Results without filtering (searches all documents):\n")

    for i, doc in enumerate(response.documents, 1):
        print(f"Result {i}:")
        print(f"  Content: {doc.text[:150]}...")
        print(f"  Relevance Score: {doc.relevancy}")
        print(f"  Document ID: {doc.id}")
        # Show metadata if available
        if hasattr(doc, 'metadata') and doc.metadata:
            print(f"  Project: {doc.metadata.get('project', 'N/A')}")
            print(f"  Year: {doc.metadata.get('year', 'N/A')}")
            print(f"  Status: {doc.metadata.get('status', 'N/A')}")
        print()

except Exception as e:
    print(f"❌ Error querying pipeline: {e}")
    raise
// This snippet uses async operations and should be run in an async context
(async () => {
    const v = require('@vectorize-io/vectorize-client')

    // Create pipelines API client
    const pipelinesApi = new vectorize.PipelinesApi(apiClient);

    let response;
    try {
        // Query without any metadata filters
        response = await pipelinesApi.retrieveDocuments({
            organizationId: "your-org-id",
            pipelineId: pipelineId,
            retrieveDocumentsRequest: {
                question: "What are the technical requirements for the AI search?",
                numResults: 5
            }
        });

        // Display results
        console.log("Query: 'What are the technical requirements for the AI search?'");
        console.log("Results without filtering (searches all documents):\n");

        response.documents.forEach((doc, i) => {
            console.log(`Result ${i + 1}:`);
            console.log(`  Content: ${doc.text.substring(0, 150)}...`);
            console.log(`  Relevance Score: ${doc.relevancy}`);
            console.log(`  Document ID: ${doc.id}`);
            // Show metadata if available
            if (doc.metadata) {
                console.log(`  Project: ${doc.metadata.project || 'N/A'}`);
                console.log(`  Year: ${doc.metadata.year || 'N/A'}`);
                console.log(`  Status: ${doc.metadata.status || 'N/A'}`);
            }
            console.log();
        });

    } catch (error) {
        console.error(`❌ Error querying pipeline: ${error}`);
        throw error;
    }
})();

Without filters, your search might return:

Marketing documents when you wanted technical specs.
Draft content mixed with approved versions.
Results from all departments.

Step 6: Query With Metadata Filters

Now query using metadata filters for precise results:

Python
Node.js
import vectorize_client as v

# Create pipelines API client
pipelines_api = v.PipelinesApi(apiClient)

try:
    # Query with project and date-based filters
    response = pipelines_api.retrieve_documents(
        organization_id,
        pipeline_id,
        v.RetrieveDocumentsRequest(
            question="What are the technical requirements for the AI search?",
            num_results=5,
            metadata_filters=[
                {
                    "metadata.project": ["apollo", "mercury"]  # Project tags
                },
                {
                    "metadata.year": ["2024", "2025"]  # Target years
                },
                {
                    "metadata.status": ["approved", "published"]  # Only finalized docs
                }
            ]
        )
    )

    # Display filtered results
    print(f"Query: 'What are the technical requirements for the AI search?'")
    print(f"Filters: project IN (apollo, mercury) AND year IN (2024, 2025) AND status IN (approved, published)")
    print(f"Results (recent approved docs from specific projects):\n")

    for i, doc in enumerate(response.documents, 1):
        print(f"Result {i}:")
        print(f"  Content: {doc.text[:150]}...")
        print(f"  Relevance Score: {doc.relevancy}")
        print(f"  Document ID: {doc.id}")
        # Show metadata to confirm filtering worked
        if hasattr(doc, 'metadata') and doc.metadata:
            print(f"  Project: {doc.metadata.get('project', 'N/A')}")
            print(f"  Year: {doc.metadata.get('year', 'N/A')}")
            print(f"  Status: {doc.metadata.get('status', 'N/A')}")
            print(f"  Document Type: {doc.metadata.get('document_type', 'N/A')}")
        print()

except Exception as e:
    print(f"❌ Error querying pipeline: {e}")
    raise
// This snippet uses async operations and should be run in an async context
(async () => {
    const v = require('@vectorize-io/vectorize-client')

    // COMPLETE_EXAMPLE_PREREQUISITES:
    // - env_vars: VECTORIZE_API_KEY, VECTORIZE_ORGANIZATION_ID
    // - files: project_apollo.txt (Document with project metadata), project_mercury.txt (Document with project metadata)
    // - description: Use metadata to improve search relevance

    // Create pipelines API client
    const pipelinesApi = new vectorize.PipelinesApi(apiClient);

    let response;
    try {
        // Query with project and date-based filters
        response = await pipelinesApi.retrieveDocuments({
            organizationId: "your-org-id",
            pipelineId: pipelineId,
            retrieveDocumentsRequest: {
                question: "What are the technical requirements for the AI search?",
                numResults: 5,
                metadataFilters: [
                    {
                        "metadata.project": ["apollo", "mercury"]  // Project tags
                    },
                    {
                        "metadata.year": ["2024", "2025"]  // Target years
                    },
                    {
                        "metadata.status": ["approved", "published"]  // Only finalized docs
                    }
                ]
            }
        });

        // Display filtered results
        console.log("Query: 'What are the technical requirements for the AI search?'");
        console.log("Filters: project IN (apollo, mercury) AND year IN (2024, 2025) AND status IN (approved, published)");
        console.log("Results (recent approved docs from specific projects):\n");

        response.documents.forEach((doc, i) => {
            console.log(`Result ${i + 1}:`);
            console.log(`  Content: ${doc.text.substring(0, 150)}...`);
            console.log(`  Relevance Score: ${doc.relevancy}`);
            console.log(`  Document ID: ${doc.id}`);
            // Show metadata to confirm filtering worked
            if (doc.metadata) {
                console.log(`  Project: ${doc.metadata.project || 'N/A'}`);
                console.log(`  Year: ${doc.metadata.year || 'N/A'}`);
                console.log(`  Status: ${doc.metadata.status || 'N/A'}`);
                console.log(`  Document Type: ${doc.metadata.document_type || 'N/A'}`);
            }
            console.log();
        });

    } catch (error) {
        console.error(`❌ Error querying pipeline: ${error}`);
        throw error;
    }
})();

Metadata Filter Syntax

Use the metadata. prefix for user-defined fields.
Provide values as arrays (even for single values).
Multiple values for the same key use OR logic.
Different keys use AND logic.

Example filter structure:

[
  { "metadata.project": ["apollo", "mercury"] },
  { "metadata.status": ["approved"] }
]

This filter returns documents from either apollo OR mercury projects that are also approved.

Step 7: Compare the Impact of Metadata

Let's see the dramatic difference metadata filtering makes. Here's what you might see:

Without Metadata Filters

Query: "What are our API rate limits?"

Results might include:
- Marketing blog post mentioning API limits
- Draft engineering spec with outdated limits
- Customer FAQ about rate limits
- Internal discussion about changing limits

With Metadata Filters

Query: "What are our API rate limits?"
Filters: project=apollo, status=approved

Results now include only:
- Approved apollo project documentation with current rate limits
- Official API specification documents from the apollo project

The filtered results are more accurate, authoritative, and relevant to your needs.

Using Visual Schema Editor (Optional)

For pipelines that use the Iris model, Vectorize includes a Visual Schema Editor that can automatically extract metadata based on defined schemas. This is especially useful when you have consistent document structures.

When to Use Automatic Metadata Extraction

Structured documents: Technical specs, contracts, reports with consistent sections
Standardized formats: Documents following templates
Large volumes: When manual metadata tagging isn't practical

Enabling Automatic Extraction

Navigate to your pipeline in the Vectorize platform
Click on the Schema tab
Use the Visual Schema Editor to define extraction rules
Save and redeploy your pipeline

The schema editor allows you to:

Define metadata fields to extract
Set extraction rules based on document structure
Preview extraction results
Combine with manual metadata

Best Practices for Metadata

1. Keep It Consistent

# Good: Consistent types and values
metadata = {
    "project": "apollo",         # Always lowercase
    "year": "2024",              # Always string
    "status": "approved"         # Consistent values
}

# Bad: Inconsistent types and formats
metadata = {
    "project": "Apollo",         # Sometimes capital (BAD!)
    "year": 2024,                # Sometimes number
    "status": "Approved"         # Inconsistent casing
}

2. Plan Your Schema

Before uploading documents, decide on:

Essential metadata fields (3-7 is usually optimal)
Allowed values for each field
Naming conventions (use lowercase with underscores)

3. Use Metadata for Business Logic

# Filter for recent, approved documents in selected projects
filters = [
    {"metadata.project": ["apollo", "mercury"]},
    {"metadata.year": ["2024", "2025"]},
    {"metadata.status": ["approved", "published"]}
]

What's Next?

You've now built a metadata-enhanced RAG pipeline that can:

Process documents with rich context
Filter results based on business needs
Provide more accurate, relevant answers

Next Steps

For simple use cases: You're ready to deploy! Start uploading your documents with metadata.
For complex scenarios: Explore automatic metadata extraction for large document sets.

Quick Tips

Start with 3-5 metadata fields and expand as needed
Test your metadata filters with diverse queries
Monitor which metadata fields provide the most value
Consider combining manual and automatic metadata extraction

Congratulations! You've learned how to make your AI significantly smarter with metadata. Your RAG pipeline can now provide contextual, filtered responses that match your specific business needs.

Complete Example

Here's all the code from this guide combined into a complete, runnable example:

Python
Node.js

Required Environment Variables:
• `VECTORIZE_API_KEY`
• `VECTORIZE_ORGANIZATION_ID`

Required Files:
• `project_apollo.txt` • Document with project metadata
• `project_mercury.txt` • Document with project metadata

#!/usr/bin/env python3
"""
Complete example for making your AI smarter with metadata enrichment.
This is a hand-written example that corresponds to the test file:
api-clients/python/tests/developer_journeys/make_your_ai_smarter_with_metadata.py

IMPORTANT: Keep this file in sync with the test file's snippets!

This example shows how to:
1. Create a file upload connector
2. Upload documents with rich metadata
3. Create a metadata-aware pipeline
4. Query without metadata filters (baseline)
5. Query with specific metadata filters for targeted results
"""

import os
import sys
import time
import json
import urllib3
from pathlib import Path
import vectorize_client as v


def get_api_config():
    """Get API configuration from environment variables."""
    organization_id = os.environ.get("VECTORIZE_ORGANIZATION_ID")
    api_key = os.environ.get("VECTORIZE_API_KEY")
    
    if not organization_id or not api_key:
        print("🔑 Setup required:")
        print("1. Get your API key from: https://app.vectorize.io/settings")
        print("2. Set environment variables:")
        print("   export VECTORIZE_ORGANIZATION_ID='your-org-id'")
        print("   export VECTORIZE_API_KEY='your-api-key'")
        sys.exit(1)
    
    # Always use production API
    configuration = v.Configuration(
        host="https://api.vectorize.io/v1",
        access_token=api_key
    )
    
    return configuration, organization_id


def create_connector(api_client, organization_id):
    """Create a file upload connector for metadata-enhanced documents."""
    print("📁 Step 1: Create a File Upload Connector")
    
    # Create the connectors API client
    connectors_api = v.SourceConnectorsApi(api_client)
    
    try:
        # Create a file upload connector
        file_upload = v.FileUpload(
            name="metadata-enhanced-documents",
            type="FILE_UPLOAD",
            config={}
        )
        
        request = v.CreateSourceConnectorRequest(file_upload)
        response = connectors_api.create_source_connector(
            organization_id,
            request
        )
        
        connector_id = response.connector.id
        print(f"✅ Created file upload connector: {connector_id}")
        
        return connector_id
        
    except Exception as e:
        print(f"❌ Error creating connector: {e}")
        raise


def create_sample_documents():
    """Create sample documents with rich metadata for demonstration."""
    documents = [
        {
            "filename": "product_requirements.txt",
            "content": """# Product Requirements Document - Apollo Project

## Executive Summary
This document outlines the requirements for the Apollo AI-powered search platform, 
designed to revolutionize how users find and interact with information.

## Core Features
1. Semantic search capabilities
2. Real-time result ranking
3. Multi-language support
4. Advanced filtering options

## Technical Requirements
- Sub-100ms response times
- 99.9% uptime SLA
- Scalable to 1M+ queries/day
- RESTful API architecture

## Success Metrics
- User satisfaction > 85%
- Search accuracy > 90%
- Average response time < 50ms

## Timeline
Phase 1: Q1 2024 - Core search functionality
Phase 2: Q2 2024 - Advanced filtering
Phase 3: Q3 2024 - Multi-language support

## Budget
Estimated development cost: $2.5M
Annual operational cost: $500K
""",
            "metadata": {
                "document_type": "requirements",
                "project": "apollo",
                "year": "2024",
                "status": "approved",
                "created_date": "2024-01-15",
                "priority": "high",
                "department": "product",
                "author": "product_team"
            }
        },
        {
            "filename": "marketing_strategy.txt",
            "content": """# Marketing Strategy - Mercury Campaign

## Campaign Overview
The Mercury campaign aims to establish market leadership in the AI search space
through targeted outreach and thought leadership.

## Target Audience
- Enterprise technology buyers
- Data science teams
- Product managers
- Technical decision makers

## Channel Strategy
1. Content marketing (blog posts, whitepapers)
2. Conference speaking and sponsorships
3. Webinar series on AI search trends
4. Strategic partnerships

## Key Messages
- "Fastest AI search on the market"
- "Enterprise-grade security and compliance"
- "Seamless integration with existing tools"

## Budget Allocation
- Content creation: 40%
- Events and conferences: 30%
- Digital advertising: 20%
- Partnership development: 10%

## Success Metrics
- Lead generation: 500 qualified leads/month
- Brand awareness: 25% increase in 6 months
- Pipeline contribution: $5M ARR
- Content engagement: 15% average engagement rate

## Timeline
Q1 2024: Campaign launch and initial content
Q2 2024: Major conference presence
Q3 2024: Partnership announcements
Q4 2024: Results analysis and planning for 2025
""",
            "metadata": {
                "document_type": "strategy",
                "project": "mercury",
                "year": "2024",
                "status": "published",
                "created_date": "2024-02-01",
                "priority": "medium",
                "department": "marketing",
                "author": "marketing_team"
            }
        },
        {
            "filename": "technical_architecture.txt",
            "content": """# Technical Architecture Document

## System Overview
This document describes the technical architecture for our next-generation
search platform, codenamed "Apollo".

## Architecture Components

### API Gateway
- Rate limiting and authentication
- Request/response transformation
- Load balancing across search nodes

### Search Engine Core
- Vector similarity search using Faiss
- Hybrid search combining vector and keyword matching
- Real-time indexing pipeline

### Database Layer
- Vector embeddings storage (PostgreSQL with pgvector)
- Metadata storage (PostgreSQL)
- Search logs and analytics (ClickHouse)

### Machine Learning Pipeline
- Document embedding generation (OpenAI Ada-002)
- Query embedding generation
- Relevance ranking models
- A/B testing framework for ranking improvements

### Infrastructure
- Kubernetes deployment on AWS EKS
- Auto-scaling based on query volume
- Multi-zone deployment for high availability
- CDN for static assets and caching

## Security Considerations
- End-to-end encryption for all data
- Role-based access control (RBAC)
- API key management and rotation
- Audit logging for all operations

## Performance Requirements
- Query response time: <100ms p95
- Indexing throughput: 10,000 docs/minute
- Concurrent queries: 1,000 queries/second
- Storage: Up to 10TB of indexed content

## Monitoring and Observability
- Distributed tracing with Jaeger
- Metrics collection with Prometheus
- Log aggregation with ELK stack
- Custom dashboards for business metrics
""",
            "metadata": {
                "document_type": "architecture",
                "project": "apollo",
                "year": "2024",
                "status": "approved",
                "created_date": "2024-01-30",
                "priority": "high",
                "department": "engineering",
                "author": "tech_lead"
            }
        }
    ]
    
    return documents


def upload_document_with_metadata(api_client, organization_id, source_connector_id):
    """Upload documents with rich metadata to improve search relevance."""
    print("📄 Step 2: Upload Documents with Rich Metadata")
    
    # Create uploads API client
    uploads_api = v.UploadsApi(api_client)
    http = urllib3.PoolManager()
    
    # Get sample documents
    documents = create_sample_documents()
    
    uploaded_count = 0
    
    for doc in documents:
        try:
            # Write document to temporary file
            temp_path = f"/tmp/{doc['filename']}"
            with open(temp_path, 'w') as f:
                f.write(doc['content'])
            
            # Convert metadata to JSON string
            metadata_json = json.dumps(doc['metadata'])
            
            # Step 1: Get upload URL with metadata
            upload_request = v.StartFileUploadToConnectorRequest(
                name=doc['filename'],  # Just the filename, not the full path
                content_type="text/plain",
                metadata=metadata_json  # Metadata as JSON string
            )
            
            start_response = uploads_api.start_file_upload_to_connector(
                organization_id,
                source_connector_id,
                start_file_upload_to_connector_request=upload_request
            )
            
            # Step 2: Upload file
            with open(temp_path, "rb") as f:
                response = http.request(
                    "PUT",
                    start_response.upload_url,
                    body=f,
                    headers={
                        "Content-Type": "text/plain",
                        "Content-Length": str(os.path.getsize(temp_path))
                    }
                )
            
            if response.status == 200:
                print(f"✅ Uploaded: {doc['filename']}")
                print(f"   Project: {doc['metadata']['project']}")
                print(f"   Department: {doc['metadata']['department']}")
                print(f"   Status: {doc['metadata']['status']}")
                uploaded_count += 1
            
            # Clean up temporary file
            os.unlink(temp_path)
                
        except Exception as e:
            print(f"❌ Error uploading {doc['filename']}: {e}")
            # Clean up temp file if it exists
            temp_path = f"/tmp/{doc['filename']}"
            if os.path.exists(temp_path):
                os.unlink(temp_path)
    
    print(f"\n✅ Successfully uploaded {uploaded_count} documents with metadata")
    return uploaded_count


def create_pipeline(api_client, organization_id, source_connector_id):
    """Create a pipeline for metadata-enhanced documents."""
    print("🔧 Step 3: Create a Pipeline for Metadata-Enhanced Documents")
    
    # Get system connector IDs from environment
    ai_platform_connector_id = os.environ.get('VECTORIZE_AI_PLATFORM_CONNECTOR_ID_VECTORIZE')
    destination_connector_id = os.environ.get('VECTORIZE_DESTINATION_CONNECTOR_ID_VECTORIZE')
    
    pipelines_api = v.PipelinesApi(api_client)
    
    try:
        # Create pipeline - metadata handling is automatic
        pipeline_config = v.PipelineConfigurationSchema(
            pipeline_name="Metadata-Enhanced RAG Pipeline",
            source_connectors=[
                v.PipelineSourceConnectorSchema(
                    id=source_connector_id,
                    type="FILE_UPLOAD",
                    config={}  # No special config needed for metadata
                )
            ],
            ai_platform_connector=v.PipelineAIPlatformConnectorSchema(
                id=ai_platform_connector_id,
                type="VECTORIZE",
                config={}
            ),
            destination_connector=v.PipelineDestinationConnectorSchema(
                id=destination_connector_id,
                type="VECTORIZE",
                config={}
            ),
            schedule=v.ScheduleSchema(type="manual")
        )
        
        # Create the pipeline
        response = pipelines_api.create_pipeline(
            organization_id,
            pipeline_config
        )
        
        pipeline_id = response.data.id
        print(f"✅ Created pipeline: {pipeline_id}")
        print(f"   Documents with metadata will be automatically processed")
        print(f"   Metadata will be preserved and searchable")
        
        return pipeline_id
        
    except Exception as e:
        print(f"❌ Error creating pipeline: {e}")
        raise


def wait_for_processing(api_client, organization_id, pipeline_id):
    """Wait for the pipeline to be ready and process documents with metadata."""
    print("⏳ Step 4: Wait for Processing")
    
    # Create pipelines API client
    pipelines_api = v.PipelinesApi(api_client)
    
    print("Waiting for metadata extraction and indexing...")
    max_wait_time = 300
    start_time = time.time()
    
    while True:
        try:
            pipeline = pipelines_api.get_pipeline(organization_id, pipeline_id)
            status = pipeline.data.status
            
            print(f"Pipeline status: {status}")
            
            if status == "LISTENING":
                print("✅ Pipeline ready with metadata indexes!")
                break
            elif status in ["ERROR_DEPLOYING", "SHUTDOWN"]:
                print(f"❌ Pipeline error: {status}")
                raise Exception(f"Pipeline failed with status: {status}")
            
            if time.time() - start_time > max_wait_time:
                print("⏰ Timeout waiting for pipeline")
                raise Exception("Pipeline processing timeout")
            
            time.sleep(10)
            
        except Exception as e:
            if "Pipeline failed" in str(e) or "timeout" in str(e):
                raise
            print(f"❌ Error checking status: {e}")
            break


def query_without_filters(api_client, organization_id, pipeline_id):
    """Query the pipeline without any metadata filters for comparison."""
    print("🔍 Step 5: Query Without Metadata Filters")
    
    # Create pipelines API client
    pipelines_api = v.PipelinesApi(api_client)
    
    try:
        # Query without any metadata filters
        response = pipelines_api.retrieve_documents(
            organization_id,
            pipeline_id,
            v.RetrieveDocumentsRequest(
                question="What are the technical requirements for the AI search?",
                num_results=5
            )
        )
        
        # Display results
        print(f"Query: 'What are the technical requirements for the AI search?'")
        print(f"Results without filtering (searches all documents):\n")
        
        for i, doc in enumerate(response.documents, 1):
            print(f"Result {i}:")
            print(f"  Content: {doc.text[:150]}...")
            print(f"  Relevance Score: {doc.relevancy}")
            print(f"  Document ID: {doc.id}")
            # Show metadata if available
            if hasattr(doc, 'metadata') and doc.metadata:
                print(f"  Project: {doc.metadata.get('project', 'N/A')}")
                print(f"  Year: {doc.metadata.get('year', 'N/A')}")
                print(f"  Status: {doc.metadata.get('status', 'N/A')}")
                print(f"  Department: {doc.metadata.get('department', 'N/A')}")
            print()
        
        return len(response.documents)
        
    except Exception as e:
        print(f"❌ Error querying pipeline: {e}")
        raise


def query_with_filters(api_client, organization_id, pipeline_id):
    """Query the pipeline with specific metadata filters for targeted results."""
    print("🎯 Step 6: Query With Metadata Filters")
    
    # Create pipelines API client
    pipelines_api = v.PipelinesApi(api_client)
    
    # Demonstrate different filtering scenarios
    filter_scenarios = [
        {
            "name": "Project-specific search",
            "description": "Find technical requirements only from the Apollo project",
            "filters": [
                {
                    "metadata.project": ["apollo"]
                },
                {
                    "metadata.document_type": ["requirements", "architecture"]
                }
            ]
        },
        {
            "name": "Department and status filtering",
            "description": "Find approved documents from engineering",
            "filters": [
                {
                    "metadata.department": ["engineering", "product"]
                },
                {
                    "metadata.status": ["approved"]
                }
            ]
        },
        {
            "name": "Time-based filtering",
            "description": "Find recent high-priority documents",
            "filters": [
                {
                    "metadata.year": ["2024"]
                },
                {
                    "metadata.priority": ["high"]
                }
            ]
        }
    ]
    
    for scenario in filter_scenarios:
        print(f"\n{'='*60}")
        print(f"📋 {scenario['name']}")
        print(f"Description: {scenario['description']}")
        
        try:
            # Query with metadata filters
            response = pipelines_api.retrieve_documents(
                organization_id,
                pipeline_id,
                v.RetrieveDocumentsRequest(
                    question="What are the technical requirements for the AI search?",
                    num_results=5,
                    metadata_filters=scenario['filters']
                )
            )
            
            # Display filtered results
            print(f"Query: 'What are the technical requirements for the AI search?'")
            filter_desc = " AND ".join([
                f"{list(f.keys())[0]} IN ({', '.join(list(f.values())[0])})"
                for f in scenario['filters']
            ])
            print(f"Filters: {filter_desc}")
            print(f"Results: {len(response.documents)} documents found\n")
            
            if response.documents:
                for i, doc in enumerate(response.documents, 1):
                    print(f"Result {i}:")
                    print(f"  Content: {doc.text[:120]}...")
                    print(f"  Relevance Score: {doc.relevancy}")
                    print(f"  Document ID: {doc.id}")
                    # Show metadata to confirm filtering worked
                    if hasattr(doc, 'metadata') and doc.metadata:
                        print(f"  Project: {doc.metadata.get('project', 'N/A')}")
                        print(f"  Year: {doc.metadata.get('year', 'N/A')}")
                        print(f"  Status: {doc.metadata.get('status', 'N/A')}")
                        print(f"  Document Type: {doc.metadata.get('document_type', 'N/A')}")
                        print(f"  Department: {doc.metadata.get('department', 'N/A')}")
                        print(f"  Priority: {doc.metadata.get('priority', 'N/A')}")
                    print()
            else:
                print("  No documents matched the specified filters")
            
        except Exception as e:
            print(f"❌ Error querying pipeline for scenario '{scenario['name']}': {e}")
            continue
    
    print(f"\n✅ Successfully demonstrated metadata filtering scenarios!")


def main():
    """Main function demonstrating metadata-enhanced RAG."""
    print("🏷️ Make Your AI Smarter with Metadata\n")
    
    # Initialize the API client
    configuration, organization_id = get_api_config()
    
    print(f"⚙️ Configuration:")
    print(f"   Organization ID: {organization_id}")
    print(f"   Host: {configuration.host}\n")
    
    source_connector_id = None
    pipeline_id = None

    try:
        # Initialize API client with proper headers for local env
        with v.ApiClient(configuration) as api_client:
            # Step 1: Create a file upload connector
            source_connector_id = create_connector(api_client, organization_id)
            print("")
            
            # Step 2: Upload documents with metadata
            uploaded_count = upload_document_with_metadata(api_client, organization_id, source_connector_id)
            print("")
            
            # Step 3: Create a metadata-aware pipeline
            pipeline_id = create_pipeline(api_client, organization_id, source_connector_id)
            print("")
            
            # Step 4: Monitor processing
            wait_for_processing(api_client, organization_id, pipeline_id)
            print("")
            
            # Step 5: Query without filters (baseline)
            print(f"\n🎉 Metadata Enhancement Complete!")
            print(f"\n📝 What you've learned:")
            print("- How to upload documents with rich metadata")
            print("- How metadata is automatically indexed by Vectorize")
            print("- How to query without filters for baseline results")
            print("- How to use metadata filters for targeted search")
            print("- How to combine multiple metadata filters")
            print("- How metadata improves search relevance and precision")
            
            print(f"\n💡 Advanced metadata strategies:")
            print("- Use hierarchical metadata (e.g., 'department.team.role')")
            print("- Include temporal metadata for time-based filtering")
            print("- Add content-type metadata for different document formats")
            print("- Use priority/importance metadata for result ranking")
            print("- Include access-control metadata for security")

    except ValueError as e:
        print(f"❌ Configuration Error: {e}")
        print("\n💡 Make sure to set the required environment variables:")
        print("   export VECTORIZE_ORGANIZATION_ID='your-org-id'")
        print("   export VECTORIZE_API_KEY='your-api-key'")
        
    except Exception as error:
        print(f"❌ Error: {error}")
        sys.exit(1)
    
    finally:
        # ============================================================================
        # Cleanup
        # ============================================================================
        print("\n🧹 Cleanup")
        
        try:
            # Initialize API client with proper headers for local env
            with v.ApiClient(configuration) as api_client:
                # Delete pipeline
                if pipeline_id:
                    try:
                        pipelines_api = v.PipelinesApi(api_client)
                        pipelines_api.delete_pipeline(organization_id, pipeline_id)
                        print(f"Deleted pipeline: {pipeline_id}")
                    except Exception as e:
                        print(f"Could not delete pipeline: {e}")
                
                # Delete source connector
                if source_connector_id:
                    try:
                        connectors_api = v.SourceConnectorsApi(api_client)
                        connectors_api.delete_source_connector(organization_id, source_connector_id)
                        print(f"Deleted connector: {source_connector_id}")
                    except Exception as e:
                        print(f"Could not delete connector: {e}")
        except:
            pass


if __name__ == "__main__":
    main()

#!/usr/bin/env node
/**
 * Complete example for enhancing AI with metadata filtering.
 * This is a hand-written example that corresponds to the test file:
 * api-clients/javascript/tests/developer_journeys/make_your_ai_smarter_with_metadata.js
 * 
 * IMPORTANT: Keep this file in sync with the test file's snippets!
 */

const vectorize = require('@vectorize-io/vectorize-client');
const fs = require('fs');
const path = require('path');

// For test environment, use test configuration
function getApiConfig() {
    // Check if we're in test environment
    if (process.env.VECTORIZE_TEST_MODE === 'true') {
        const testConfigPath = path.join(__dirname, '../common/test_config.js');
        if (fs.existsSync(testConfigPath)) {
            const { getApiClient } = require(testConfigPath);
            const { apiConfig, config } = getApiClient();
            return { apiClient: apiConfig, organizationId: config.organization_id };
        }
    }
    
    // Fall back to environment variables
    const organizationId = process.env.VECTORIZE_ORGANIZATION_ID;
    const apiKey = process.env.VECTORIZE_API_KEY;
    
    if (!organizationId || !apiKey) {
        throw new Error("Please set VECTORIZE_ORGANIZATION_ID and VECTORIZE_API_KEY environment variables");
    }

    const configuration = new vectorize.Configuration({
        basePath: 'https://api.vectorize.io/v1',
        accessToken: apiKey
    });
    
    return { apiClient: configuration, organizationId };
}

async function main() {
    console.log('🏷️ Enhancing RAG with Metadata\n');
    
    // Initialize the API client
    const { apiClient: apiConfig, organizationId } = getApiConfig();
    
    let sourceConnectorId = null;
    let pipelineId = null;

    try {
        // ============================================================================
        // Step 1: Create a File Upload Connector
        // ============================================================================
        console.log('📁 Step 1: Create a File Upload Connector');
        
        const connectorsApi = new vectorize.SourceConnectorsApi(apiConfig);
        
        const fileUpload = {
            name: "metadata-enhanced-documents",
            type: "FILE_UPLOAD",
            config: {}
        };
        
        const connectorResponse = await connectorsApi.createSourceConnector({
            organizationId: organizationId,
            createSourceConnectorRequest: fileUpload
        });
        
        sourceConnectorId = connectorResponse.connector.id;
        console.log(`✅ Created file upload connector: ${sourceConnectorId}\n`);

        // ============================================================================
        // Step 2: Upload Documents with Rich Metadata
        // ============================================================================
        console.log('📄 Step 2: Upload Documents with Rich Metadata');
        
        const uploadsApi = new vectorize.UploadsApi(apiConfig);
        
        // Sample documents with different metadata
        const documents = [
            {
                name: "product_requirements.txt",
                content: `# Product Requirements - Apollo Project

## Overview
The Apollo project aims to build a next-generation AI search platform with advanced capabilities.

## Key Requirements
- Real-time search capabilities
- Multi-modal content support (text, images, documents)
- Advanced filtering and metadata-based retrieval
- Scalable architecture supporting 10,000+ concurrent users
- Enterprise-grade security and compliance

## Technical Specifications
- API-first architecture with GraphQL and REST endpoints
- Machine learning-powered relevance scoring
- Comprehensive audit logging
- Integration with existing enterprise systems

## Success Metrics
- Sub-200ms search response times
- 99.9% uptime SLA
- High user satisfaction scores`,
                metadata: {
                    document_type: "requirements",
                    project: "apollo",
                    year: "2024",
                    status: "approved",
                    created_date: "2024-01-15",
                    priority: "high"
                }
            },
            {
                name: "marketing_strategy.txt",
                content: `# Marketing Strategy - Mercury Campaign

## Campaign Overview
The Mercury campaign focuses on positioning our AI search solution as the leading enterprise platform.

## Target Audience
- Enterprise CTOs and technical decision makers
- Data engineering teams
- Business intelligence professionals

## Key Messages
- Fastest time-to-value in the industry
- Enterprise-grade security and compliance
- Seamless integration with existing tech stacks
- Proven ROI with measurable business impact

## Campaign Tactics
- Technical webinar series
- Industry conference presence
- Customer case study development
- Thought leadership content

## Success Metrics
- Lead generation targets
- Pipeline conversion rates
- Brand awareness metrics`,
                metadata: {
                    document_type: "strategy",
                    project: "mercury",
                    year: "2024",
                    status: "published",
                    created_date: "2024-02-01",
                    priority: "medium"
                }
            },
            {
                name: "technical_architecture.txt",
                content: `# Technical Architecture - Apollo Platform

## System Architecture
The Apollo platform uses a modern, cloud-native architecture designed for scale and reliability.

## Core Components
- Vector Database Layer: Pinecone for high-performance vector storage
- API Gateway: Kong for request routing and rate limiting
- Search Engine: Custom Rust-based search with ML ranking
- Data Pipeline: Apache Airflow for ETL orchestration

## Security Architecture
- OAuth 2.0 and SAML integration
- Role-based access control (RBAC)
- Data encryption at rest and in transit
- SOC2 Type II compliance

## Performance Characteristics
- 99.99% uptime target
- Sub-100ms p95 search latency
- Horizontal scaling to 100k+ QPS
- Multi-region deployment capability

## Integration Points
- REST and GraphQL APIs
- Webhook support for real-time updates
- SDK availability for major programming languages`,
                metadata: {
                    document_type: "architecture",
                    project: "apollo",
                    year: "2024",
                    status: "approved",
                    created_date: "2024-01-20",
                    priority: "high"
                }
            }
        ];

        // Upload each document with its metadata
        for (const doc of documents) {
            // Create temporary file
            const tempPath = `/tmp/${doc.name}`;
            fs.writeFileSync(tempPath, doc.content);
            
            // Convert metadata to JSON string
            const metadataJson = JSON.stringify(doc.metadata);
            
            // Step 1: Get upload URL with metadata
            const uploadRequest = {
                name: doc.name,
                contentType: "text/plain",
                metadata: metadataJson
            };
            
            const startResponse = await uploadsApi.startFileUploadToConnector({
                organizationId: organizationId,
                connectorId: sourceConnectorId,
                startFileUploadToConnectorRequest: uploadRequest
            });
            
            // Step 2: Upload file
            const fileContent = fs.readFileSync(tempPath);
            const uploadResponse = await fetch(startResponse.uploadUrl, {
                method: 'PUT',
                body: fileContent,
                headers: {
                    'Content-Type': 'text/plain',
                    'Content-Length': fs.statSync(tempPath).size
                }
            });
            
            if (uploadResponse.ok) {
                console.log(`✅ Uploaded: ${doc.name}`);
                console.log(`   Metadata: project=${doc.metadata.project}, type=${doc.metadata.document_type}, status=${doc.metadata.status}`);
            } else {
                console.log(`❌ Upload failed: ${uploadResponse.status}`);
            }
            
            // Clean up temp file
            fs.unlinkSync(tempPath);
        }
        
        console.log('');

        // ============================================================================
        // Step 3: Create a Pipeline for Metadata-Enhanced Documents
        // ============================================================================
        console.log('🔧 Step 3: Create a Pipeline for Metadata-Enhanced Documents');
        
        // You'll need to set these to your actual connector IDs
        // For this example, we'll use placeholder values
        const aiPlatformConnectorId = process.env.VECTORIZE_AI_PLATFORM_CONNECTOR_ID || 'your-ai-platform-connector-id';
        const destinationConnectorId = process.env.VECTORIZE_DESTINATION_CONNECTOR_ID || 'your-destination-connector-id';
        
        const pipelinesApi = new vectorize.PipelinesApi(apiConfig);
        
        const pipelineConfig = {
            pipelineName: "Metadata-Enhanced RAG Pipeline",
            sourceConnectors: [
                {
                    id: sourceConnectorId,
                    type: "FILE_UPLOAD",
                    config: {}
                }
            ],
            aiPlatformConnector: {
                id: aiPlatformConnectorId,
                type: "VECTORIZE",
                config: {}
            },
            destinationConnector: {
                id: destinationConnectorId,
                type: "VECTORIZE",
                config: {}
            },
            schedule: { type: "manual" }
        };
        
        const pipelineResponse = await pipelinesApi.createPipeline({
            organizationId: organizationId,
            pipelineConfigurationSchema: pipelineConfig
        });
        
        pipelineId = pipelineResponse.data.id;
        console.log(`✅ Created pipeline: ${pipelineId}`);
        console.log(`   Documents with metadata will be automatically processed`);
        console.log(`   Metadata will be preserved and searchable\n`);

        // ============================================================================
        // Step 4: Wait for Processing
        // ============================================================================
        console.log('⏳ Step 4: Wait for Processing');
        console.log("Waiting for metadata extraction and indexing...");
        
        const maxWaitTime = 300; // 5 minutes
        const startTime = Date.now();
        
        while (true) {
            try {
                const pipeline = await pipelinesApi.getPipeline({
                    organizationId: organizationId,
                    pipelineId: pipelineId
                });
                const status = pipeline.data.status;
                
                if (status === "LISTENING") {
                    console.log("✅ Pipeline ready with metadata indexes!\n");
                    break;
                } else if (["ERROR_DEPLOYING", "SHUTDOWN"].includes(status)) {
                    console.log(`❌ Pipeline error: ${status}`);
                    break;
                }
                
                if ((Date.now() - startTime) / 1000 > maxWaitTime) {
                    console.log("⏰ Timeout waiting for pipeline");
                    break;
                }
                
                console.log(`   Status: ${status} - waiting...`);
                await new Promise(resolve => setTimeout(resolve, 10000)); // Wait 10 seconds
                
            } catch (error) {
                console.error(`❌ Error checking status: ${error}`);
                break;
            }
        }

        // ============================================================================
        // Step 5: Query Without Metadata Filters
        // ============================================================================
        console.log('🔍 Step 5: Query Without Metadata Filters');
        
        let unfiltered_response;
        try {
            unfiltered_response = await pipelinesApi.retrieveDocuments({
                organizationId: organizationId,
                pipelineId: pipelineId,
                retrieveDocumentsRequest: {
                    question: "What are the technical requirements for the AI search?",
                    numResults: 5
                }
            });
            
            console.log("Query: 'What are the technical requirements for the AI search?'");
            console.log("Results without filtering (searches all documents):\n");
            
            unfiltered_response.documents.forEach((doc, i) => {
                console.log(`Result ${i + 1}:`);
                console.log(`  Content: ${doc.text.substring(0, 150)}...`);
                console.log(`  Relevance Score: ${doc.relevancy}`);
                console.log(`  Document ID: ${doc.id}`);
                if (doc.metadata) {
                    console.log(`  Project: ${doc.metadata.project || 'N/A'}`);
                    console.log(`  Year: ${doc.metadata.year || 'N/A'}`);
                    console.log(`  Status: ${doc.metadata.status || 'N/A'}`);
                }
                console.log();
            });
            
        } catch (error) {
            console.error(`❌ Error querying pipeline: ${error}`);
            console.log("This might be expected if running on localhost (queries route to production)\n");
        }

        // ============================================================================
        // Step 6: Query With Metadata Filters
        // ============================================================================
        console.log('🎯 Step 6: Query With Metadata Filters');
        
        try {
            const filtered_response = await pipelinesApi.retrieveDocuments({
                organizationId: organizationId,
                pipelineId: pipelineId,
                retrieveDocumentsRequest: {
                    question: "What are the technical requirements for the AI search?",
                    numResults: 5,
                    metadataFilters: [
                        {
                            "metadata.project": ["apollo", "mercury"]
                        },
                        {
                            "metadata.year": ["2024", "2025"]
                        },
                        {
                            "metadata.status": ["approved", "published"]
                        }
                    ]
                }
            });
            
            console.log("Query: 'What are the technical requirements for the AI search?'");
            console.log("Filters: project IN (apollo, mercury) AND year IN (2024, 2025) AND status IN (approved, published)");
            console.log("Results (recent approved docs from specific projects):\n");
            
            filtered_response.documents.forEach((doc, i) => {
                console.log(`Result ${i + 1}:`);
                console.log(`  Content: ${doc.text.substring(0, 150)}...`);
                console.log(`  Relevance Score: ${doc.relevancy}`);
                console.log(`  Document ID: ${doc.id}`);
                if (doc.metadata) {
                    console.log(`  Project: ${doc.metadata.project || 'N/A'}`);
                    console.log(`  Year: ${doc.metadata.year || 'N/A'}`);
                    console.log(`  Status: ${doc.metadata.status || 'N/A'}`);
                    console.log(`  Document Type: ${doc.metadata.document_type || 'N/A'}`);
                }
                console.log();
            });
            
        } catch (error) {
            console.error(`❌ Error querying pipeline: ${error}`);
            console.log("This might be expected if running on localhost (queries route to production)");
        }

        console.log('✅ Metadata-enhanced RAG example completed!');
        console.log('\n📝 Summary:');
        console.log('- Created file upload connector for documents');
        console.log('- Uploaded 3 documents with rich metadata (project, type, status, etc.)');
        console.log('- Created pipeline that automatically indexes metadata');
        console.log('- Demonstrated filtered vs unfiltered retrieval');
        console.log('- Showed how metadata improves search precision');

    } catch (error) {
        console.error('❌ Error:', error);
        process.exit(1);    } finally {
        // ============================================================================
        // Cleanup
        // ============================================================================
        console.log('\n🧹 Cleanup');
        
        // Delete pipeline
        if (pipelineId) {
            try {
                const pipelinesApi = new vectorize.PipelinesApi(apiConfig);
                await pipelinesApi.deletePipeline({organizationId: organizationId, pipelineId: pipelineId});
                console.log(`Deleted pipeline: ${pipelineId}`);
            } catch (error) {
                console.log(`Could not delete pipeline: ${error}`);
            }
        }
        
        // Delete source connector
        if (sourceConnectorId) {
            try {
                const connectorsApi = new vectorize.SourceConnectorsApi(apiConfig);
                await connectorsApi.deleteSourceConnector({organizationId: organizationId, sourceConnectorId: sourceConnectorId});
                console.log(`Deleted connector: ${sourceConnectorId}`);
            } catch (error) {
                console.log(`Could not delete connector: ${error}`);
            }
        }
    }
}

// Run the example
if (require.main === module) {
    main().catch(error => {
        console.error('❌ Error:', error);
        process.exit(1);
    });
}

module.exports = { main };

Prerequisites

Finding your Organization ID

API Client Setup

What You'll Build​

Understanding Metadata in RAG​

Step 1: Create a File Upload Connector​

Step 2: Upload Documents with Metadata​

How Metadata Works​

Metadata Best Practices​

Step 3: Create Your Pipeline​

Step 4: Wait for Processing​

Step 5: Query Without Metadata Filters​

Step 6: Query With Metadata Filters​

Metadata Filter Syntax​

Step 7: Compare the Impact of Metadata​

Without Metadata Filters​

With Metadata Filters​

Using Visual Schema Editor (Optional)​

When to Use Automatic Metadata Extraction​

Enabling Automatic Extraction​

Best Practices for Metadata​

1. Keep It Consistent​

2. Plan Your Schema​

3. Use Metadata for Business Logic​

What's Next?​

Next Steps​

Quick Tips​

Complete Example​