Using User-defined Metadata
User-defined metadata enables you to associate custom fields of information with your documents by providing metadata files alongside your source documents.
Supported source connectors
Supported source connectors:
Amazon S3
Azure Blog Storage
Dropbox
File Upload
Google Cloud Storage
Google Drive
Create User-defined Metadata
For each document you'd like to associate user-defined metadata with, create a file in valid JSON format containing the metadata.
The metadata fields must be inside the "metadata" tag:
The filename must be exactly the same as the name of the document, with the addition of .metadata.json
at the end. For example:
A different schema can be used for each document.
If metadata files are malformed, or if their names don't follow the naming format, the document (not just the metadata file) will be considered unprocessable and will be skipped.
Database-specific Formatting
All vector databases fully support JSON datatypes including nested JSON, except for Pinecone.
Pinecone only supports metadata fields of the following types:
Boolean
String
Numbers
Lists of strings
User-defined Metadata Example
Let's look at a simple example using a collection of Shakespeare's plays.
Suppose you have these files:
For each file, we'll create a metadata file containing JSON that contains information about the play. For example:
In this example, the metadata allows you to build more precise queries for your AI application, like:
Find all tragedies written before 1600
Search for plays with themes of revenge
Look for specific characters or settings
Filter by genre while doing semantic search
The filename must be exactly the same as the name of the document, with the addition of .metadata.json
at the end.
Our resulting file list looks like this:
All of these files—the original documents as well as the JSON files—must be provided to your RAG pipeline.
Walkthrough Using Vectorize
We'll walk through this example using the File Upload connector. If you're not familiar with RAG pipelines, check out Pipeline Basics.
Create a new RAG pipeline. For this example, we'll use Pinecone and OpenAI.
Specify the metadata fields you'd like to filter on using your Vectorize retrieval endpoint when configuring your vector database. Make sure to enter
metadata.
before each field. For more information about retrieval endpoints, see Using the Retrieval Endpoint
Add a File Upload connector to your pipeline, and specify both the plays and the metadata files. Click Confirm Selection when done.
Schedule your RAG pipeline as normal.
Once your pipeline populates your database, you'll be able to see the metadata in your index. This screenshot shows metadata we provided in a document record in a Pinecone index.
To access user-defined metadata using your retrieval endpoint, filter the same way you would for other types of metadata, making sure to enter
metadata.
before each field. For example:
Note that if you add a metadata file to a document that has already been processed, that will trigger the document to be processed again to add the metadata.
Troubleshooting
If the metadata you specified is not showing up in your vector index:
Is the filename correct? Each metadata filename must be exactly the same as the name of the document, with the addition of
.metadata.json
at the end.Is the JSON properly formatted?
Check the document/chunk in your vector database to ensure it's formatted as expected.
Last updated