Understanding Metadata in RAG

Metadata is the additional information that describes and categorizes your documents, like authors, dates, or topics. When stored alongside vector embeddings in your database, metadata enables powerful filtering and improves search relevance. Vectorize supports both system metadata (automatically provided) and user-defined metadata, which we'll explore in detail below.

For example, if you have a technical document, its user-defined metadata might include:

Document type: Technical specification
Author: Engineering team
Last updated: January 2025
Product area: Authentication

Why Use Metadata?

When processing documents in your RAG pipeline, important contextual details like authors, dates, or categories can help narrow down search results and provide crucial context.

For example, imagine you're building an AI assistant that needs to access:

Product documentation
Support tickets
Release notes
API documentation
Discord conversations
Intercom chat history

Without metadata, a search for "how to reset API key" would return semantically similar results across all these sources. With metadata, your assistant can filter for:

Content type (documentation vs. support conversation)
Status (resolved vs unresolved issues)
User role (admin vs. regular user questions)
Product feature (authentication, billing, API)
Issue priority (high, medium, low)
Response rating (high vs. low customer satisfaction)

This means you can build precise queries like "find recent, resolved tickets about API key resets" or "show me highly-rated support responses about authentication for admin users."

Metadata types

Vectorize supports three types of metadata:

System metadata: fields which are always provided in each document record in your vector index
User-defined metadata: user-provided metadata associated with documents
Automatic metadata: structured information extracted from documents using defined schemas

System metadata

The following metadata fields are provided in each document record in your vector index.

Field	Description
text	The actual content being embedded
chunk_id	Identifier for this specific chunk of text (e.g. "20" out of 181 total chunks).
total_chunks	Total number of chunks this document was split into
unique_source	Unique identifier combining origin_id and a hash of the content
origin_id	UUID for the original document uploaded.
filename	Original filename of the document
source	Full path/identifier of the document in storage
source_display_name	Human-readable source path
origin	Method of document ingestion. This corresponds to the source connector used to extract data in your RAG pipeline. Examples: google-drive, file-upload, web-crawler, etc.
metadata	User-defined metadata attributes

User-defined metadata

For each document you process, you can optionally provide an associated metadata file containing content to store in your vector database. When querying, you can combine semantic search with metadata filters to get exactly the results you need.

Let's look at a simple example using a collection documents containing Shakespeare's plays.

Automatic metadata

Automatic metadata extraction uses Vectorize's Iris model to analyze documents and extract structured information based on defined schemas. This allows you to automatically extract metadata without manually creating metadata files.

There are two types of automatic metadata:

Document metadata: High-level information extracted from the entire document (title, author, document type, etc.)
Section metadata: Specific information extracted from individual chunks (part numbers, prices, technical details, etc.)

Automatic metadata is stored in the vector database record under:

document_metadata field for document-level metadata
chunk_metadata field for section-level metadata

For more details, see Automatic Metadata Extraction.

For each document, we'll create a metadata file containing JSON that contains information about the play. For example:

    {
        "metadata" : {
            "title": "A Midsummer Night's Dream",
            "genre": "comedy",
            "year_written": "1595",
            "setting": ["Athens", "Nearby Forest"],
            "characters": ["Puck", "Oberon", "Titania", "Lysander", "Hermia", "Demetrius", "Helena", "Bottom"],
            "number_of_acts": 5,
            "themes": ["love", "magic", "nature", "dreams", "mischief"],
            "famous_quotes": [
            "Lord, what fools these mortals be!",
            "The course of true love never did run smooth"
            ],
            "length": {
            "words": 16087,
            "lines": 2349
            }
        }
    }

In this example, the metadata allows you to build more precise queries for your AI application, like:

Find all tragedies written before 1600
Search for plays with themes of revenge
Look for specific characters or settings
Filter by genre while doing semantic search

Metadata Query Limitations

When querying metadata in Vectorize, it's important to understand the available filtering capabilities:

Retrieval Endpoint Limitations

The retrieval endpoint in Vectorize only supports exact match filtering for metadata. This means:

You can filter for documents where a metadata field equals a specific value
You can filter for documents where a metadata field matches any value in a list (using OR logic)
You cannot use range queries, partial matches, or complex operators
Filtering is case-sensitive and requires precise matching

For example, you can filter for documents where genre = "comedy" or where genre is either "comedy" OR "tragedy" using a list of values.

When using multiple filter keys with lists of values, the logical operation is:

AND logic between different keys
OR logic between values for the same key

For example, with filters like:

"metadata-filters": [
  { "key1": ["a", "b"] },
  { "key2": ["c", "d"] }
]

This translates to: (key1 == "a" OR key1 == "b") AND (key2 == "c" OR key2 == "d")

However, you cannot filter for documents where year_written > 1600 as range queries are not supported.

Built-in Vector Database vs. Bring Your Own Vector Database

Vectorize offers two options for vector storage:

Built-in Vector Database:
- Limited to the retrieval endpoint's capabilities
- Only supports exact match filtering for metadata
- Simplifies deployment and management
Bring Your Own Vector Database (BYO):
- When using the retrieval endpoint, has the same exact match filtering limitation
- Allows direct native querying of the vector database with more advanced options
- Advanced query capabilities vary by database type (e.g., Pinecone, Weaviate, Qdrant)

If you need more advanced metadata filtering capabilities (like range queries, fuzzy matching, or complex operators), you can:

Use a BYO vector database option
Query your vector database directly using its native API
Take advantage of the specific advanced filtering capabilities of your chosen database

What's next?

Learn how to configure user-defined metadata for your indexes in Using User-defined Metadata, then use these fields for filtering in your AI application.
Explore how to automatically extract structured metadata from your documents in Automatic Metadata Extraction.
If you'd like to learn how to use a pipeline's retrieval endpoint, go to Using the Retrieval Endpoint for a RAG Pipeline.

Why Use Metadata?​

Metadata types​

System metadata​

User-defined metadata​

Automatic metadata​

Metadata Query Limitations​

Retrieval Endpoint Limitations​

Built-in Vector Database vs. Bring Your Own Vector Database​

What's next?​