Using User-defined Metadata
User-defined metadata enables you to associate custom fields of information with your documents by providing metadata files alongside your source documents.
Supported source connectors
Supported source connectors:
- Amazon S3
- Azure Blog Storage
- Dropbox
- File Upload
- Google Cloud Storage
- Google Drive
Create User-defined Metadata
For each document you'd like to associate user-defined metadata with, create a file in valid JSON format containing the metadata.
The metadata fields must be inside the "metadata" tag:
{
"metadata" : {
"field-1" : "value",
"field-2": ["value1", "value2"]
}
}
Note: For a particular metadata field name, the value should have a consistent type across all documents. For example, a metadata field named "id" should consistently be either an integer or a string, but not a mixture of both. Mixing types like
0001
andb7dc682c-7ef4-44b1-b922-4d43044b7f52
can cause issues with filtering and querying in some vector databases.
The filename must be exactly the same as the name of the document, with the addition of .metadata.json
at the end. For example:
document.pdf → document.pdf.metadata.json
example.txt → example.txt.metadata.json
Note: All metadata.json files must be in plain-text format. Binary or otherwise encoded files will not be processed correctly.
A different schema can be used for each document.
If metadata files are malformed, or if their names don't follow the naming format, the document (not just the metadata file) will be considered unprocessable and will be skipped.
Database-specific Formatting
All vector databases fully support JSON datatypes including nested JSON, except for Pinecone.
Pinecone only supports metadata fields of the following types:
- Boolean
- String
- Numbers
- Lists of strings
User-defined Metadata Example
Let's look at a simple example using a collection of Shakespeare's plays.
Suppose you have these files:
plays/
├── a-midsummer-nights-dream.pdf
├── hamlet.pdf
├── king-lear.pdf
├── macbeth.pdf
├── the-taming-of-the-shrew.pdf
For each file, we'll create a metadata file containing JSON that contains information about the play. For example:
{
"metadata" : {
"title": "A Midsummer Night's Dream",
"genre": "comedy",
"year_written": "1595",
"setting": ["Athens", "Nearby Forest"],
"characters": ["Puck", "Oberon", "Titania", "Lysander", "Hermia", "Demetrius", "Helena", "Bottom"],
"number_of_acts": 5,
"themes": ["love", "magic", "nature", "dreams", "mischief"],
"famous_quotes": [
"Lord, what fools these mortals be!",
"The course of true love never did run smooth"
],
"length": {
"words": 16087,
"lines": 2349
}
}
}
In this example, the metadata allows you to build more precise queries for your AI application, like:
- Find all tragedies written before 1600
- Search for plays with themes of revenge
- Look for specific characters or settings
- Filter by genre while doing semantic search
The filename must be exactly the same as the name of the document, with the addition of .metadata.json
at the end.
Our resulting file list looks like this:
plays/
├── a-midsummer-nights-dream.pdf
├── a-midsummer-nights-dream.pdf.metadata.json
├── hamlet.pdf
├── hamlet.pdf.metadata.json
├── king-lear.pdf
├── king-lear.pdf.metadata.json
├── macbeth.pdf
├── macbeth.pdf.metadata.json
├── the-taming-of-the-shrew.pdf
├── the-taming-of-the-shrew.pdf.metadata.json
All of these files—the original documents as well as the JSON files—must be provided to your RAG pipeline.
Walkthrough Using Vectorize
We'll walk through this example using the File Upload connector. If you're not familiar with RAG pipelines, check out Pipeline Basics.
- Create a new RAG pipeline. For this example, we'll use Pinecone and OpenAI.
- Add a File Upload connector to your pipeline, and specify both the plays and the metadata files. Click Confirm Selection when done.
-
Schedule your RAG pipeline as normal.
-
Once your pipeline populates your database, you'll be able to see the metadata in your index. This screenshot shows metadata we provided in a document record in a Pinecone index.
-
To access user-defined metadata using your retrieval endpoint, filter the same way you would for other types of metadata, making sure to enter
metadata.
before each field. For example:curl -L \
-H "Content-Type: application/json" \
-H "Authorization: <TOKEN>" \
-d '{
"question": "Which plays are tragedies?",
"numResults": 5,
"metadata-filters": [
{
"metadata.genre": ["tragedy"]
}
]
}' \
"https://client.app.vectorize.io/api/gateways/service/o9bf-57d2db385ba2/p24a5ecc3/retrieve"
Note that if you add a metadata file to a document that has already been processed, that will trigger the document to be processed again to add the metadata.
Troubleshooting
If the metadata you specified is not showing up in your vector index:
- Is the filename correct? Each metadata filename must be exactly the same as the name of the document, with the addition of
.metadata.json
at the end. - Is the JSON properly formatted?
- Check the document/chunk in your vector database to ensure it's formatted as expected.