Understanding Metadata in RAG
Metadata is the additional information that describes and categorizes your documents, like authors, dates, or topics. When stored alongside vector embeddings in your database, metadata enables powerful filtering and improves search relevance. Vectorize supports both system metadata (automatically provided) and user-defined metadata, which we'll explore in detail below.
For example, if you have a technical document, its user-defined metadata might include:
- Document type: Technical specification
- Author: Engineering team
- Last updated: January 2025
- Product area: Authentication
Why Use Metadata?
When processing documents in your RAG pipeline, important contextual details like authors, dates, or categories can help narrow down search results and provide crucial context.
For example, imagine you're building an AI assistant that needs to access:
- Product documentation
- Support tickets
- Release notes
- API documentation
- Discord conversations
- Intercom chat history
Without metadata, a search for "how to reset API key" would return semantically similar results across all these sources. With metadata, your assistant can filter for:
- Content type (documentation vs. support conversation)
- Status (resolved vs unresolved issues)
- User role (admin vs. regular user questions)
- Product feature (authentication, billing, API)
- Issue priority (high, medium, low)
- Response rating (high vs. low customer satisfaction)
This means you can build precise queries like "find recent, resolved tickets about API key resets" or "show me highly-rated support responses about authentication for admin users."
Metadata types
Vectorize supports two types of metadata:
- System metadata: fields which are always provided in each document record in your vector index
- User-defined metadata: user-provided metadata associated with documents
System metadata
The following metadata fields are provided in each document record in your vector index.
Field | Description |
---|---|
text | The actual content being embedded |
chunk_id | Identifier for this specific chunk of text (e.g. "20" out of 181 total chunks). |
total_chunks | Total number of chunks this document was split into |
unique_source | Unique identifier combining origin_id and a hash of the content |
origin_id | UUID for the original document uploaded. |
filename | Original filename of the document |
source | Full path/identifier of the document in storage |
source_display_name | Human-readable source path |
origin | Method of document ingestion. This corresponds to the source connector used to extract data in your RAG pipeline. Examples: google-drive, file-upload, web-crawler, etc. |
metadata | User-defined metadata attributes |
User-defined metadata
For each document you process, you can optionally provide an associated metadata file containing content to store in your vector database. When querying, you can combine semantic search with metadata filters to get exactly the results you need.
Let's look at a simple example using a collection documents containing Shakespeare's plays.
For each document, we'll create a metadata file containing JSON that contains information about the play. For example:
{
"metadata" : {
"title": "A Midsummer Night's Dream",
"genre": "comedy",
"year_written": "1595",
"setting": ["Athens", "Nearby Forest"],
"characters": ["Puck", "Oberon", "Titania", "Lysander", "Hermia", "Demetrius", "Helena", "Bottom"],
"number_of_acts": 5,
"themes": ["love", "magic", "nature", "dreams", "mischief"],
"famous_quotes": [
"Lord, what fools these mortals be!",
"The course of true love never did run smooth"
],
"length": {
"words": 16087,
"lines": 2349
}
}
}
In this example, the metadata allows you to build more precise queries for your AI application, like:
- Find all tragedies written before 1600
- Search for plays with themes of revenge
- Look for specific characters or settings
- Filter by genre while doing semantic search
What's next?
-
Learn how to configure user-defined metadata for your indexes in Using User-defined Metadata, then use these fields for filtering in your AI application.
OR
-
If you'd like to learn how to use a pipeline's retrieval endpoint, go to Using the Retrieval Endpoint for a RAG Pipeline.