Skip to main content

Using User-defined Metadata

User-defined metadata enables you to associate custom fields of information with your documents by providing metadata files alongside your source documents.

Supported source connectors

Supported source connectors:

  • Amazon S3
  • Azure Blog Storage
  • Dropbox
  • File Upload
  • Google Cloud Storage
  • Google Drive

Create User-defined Metadata

For each document you'd like to associate user-defined metadata with, create a file in valid JSON format containing the metadata.

The metadata fields must be inside the "metadata" tag:

    {
"metadata" : {
"field-1" : "value",
"field-2": ["value1", "value2"]
}
}

Note: For a particular metadata field name, the value should have a consistent type across all documents. For example, a metadata field named "id" should consistently be either an integer or a string, but not a mixture of both. Mixing types like 0001 and b7dc682c-7ef4-44b1-b922-4d43044b7f52 can cause issues with filtering and querying in some vector databases.

The filename must be exactly the same as the name of the document, with the addition of .metadata.json at the end. For example:

document.pdf → document.pdf.metadata.json
example.txt → example.txt.metadata.json

Note: All metadata.json files must be in plain-text format. Binary or otherwise encoded files will not be processed correctly.

A different schema can be used for each document.

If metadata files are malformed, or if their names don't follow the naming format, the document (not just the metadata file) will be considered unprocessable and will be skipped.

Database-specific Formatting

All vector databases fully support JSON datatypes including nested JSON, except for Pinecone.

Pinecone only supports metadata fields of the following types:

  • Boolean
  • String
  • Numbers
  • Lists of strings

User-defined Metadata Example

Let's look at a simple example using a collection of Shakespeare's plays.

Suppose you have these files:

plays/
├── a-midsummer-nights-dream.pdf
├── hamlet.pdf
├── king-lear.pdf
├── macbeth.pdf
├── the-taming-of-the-shrew.pdf

For each file, we'll create a metadata file containing JSON that contains information about the play. For example:

    {
"metadata" : {
"title": "A Midsummer Night's Dream",
"genre": "comedy",
"year_written": "1595",
"setting": ["Athens", "Nearby Forest"],
"characters": ["Puck", "Oberon", "Titania", "Lysander", "Hermia", "Demetrius", "Helena", "Bottom"],
"number_of_acts": 5,
"themes": ["love", "magic", "nature", "dreams", "mischief"],
"famous_quotes": [
"Lord, what fools these mortals be!",
"The course of true love never did run smooth"
],
"length": {
"words": 16087,
"lines": 2349
}
}
}

In this example, the metadata allows you to build more precise queries for your AI application using the retrieval endpoint's exact match filtering capabilities, like:

  • Find all plays with the genre "tragedy"
  • Search for plays with the theme "revenge"
  • Look for plays with specific characters like "Hamlet" or "Othello"
  • Filter by multiple genres (e.g., "comedy" OR "history") while doing semantic search

The filename must be exactly the same as the name of the document, with the addition of .metadata.json at the end.

Our resulting file list looks like this:

plays/
├── a-midsummer-nights-dream.pdf
├── a-midsummer-nights-dream.pdf.metadata.json
├── hamlet.pdf
├── hamlet.pdf.metadata.json
├── king-lear.pdf
├── king-lear.pdf.metadata.json
├── macbeth.pdf
├── macbeth.pdf.metadata.json
├── the-taming-of-the-shrew.pdf
├── the-taming-of-the-shrew.pdf.metadata.json

All of these files—the original documents as well as the JSON files—must be provided to your RAG pipeline.

Walkthrough Using Vectorize

We'll walk through this example using the File Upload connector. If you're not familiar with RAG pipelines, check out Pipeline Basics.

  1. Create a new RAG pipeline. For this example, we'll use Pinecone and OpenAI.

Pinecone and OpenAI Configuration

  1. Add a File Upload connector to your pipeline, and specify both the plays and the metadata files. Click Confirm Selection when done.

File Upload Configuration

  1. Schedule your RAG pipeline as normal.

  2. Once your pipeline populates your database, you'll be able to see the metadata in your index. This screenshot shows metadata we provided in a document record in a Pinecone index.

Metadata in Pinecone Index

  1. To access user-defined metadata using your retrieval endpoint, filter the same way you would for other types of metadata, making sure to enter metadata. before each field. For example:

    curl -L \
    -H "Content-Type: application/json" \
    -H "Authorization: <TOKEN>" \
    -d '{
    "question": "Which plays are tragedies?",
    "numResults": 5,
    "metadata-filters": [
    {
    "metadata.genre": ["tragedy"]
    }
    ]
    }' \
    "https://client.app.vectorize.io/api/gateways/service/o9bf-57d2db385ba2/p24a5ecc3/retrieve"

Filtering with Multiple Values

The retrieval endpoint supports exact match filtering with lists of values. When using multiple filter keys with lists of values, the logical operation is:

  • AND logic between different keys
  • OR logic between values for the same key

For example, this filter would find documents that are either tragedies OR comedies AND were written either in 1595 OR 1596:

"metadata-filters": [
{ "metadata.genre": ["tragedy", "comedy"] },
{ "metadata.year_written": ["1595", "1596"] }
]

This translates to: (genre == "tragedy" OR genre == "comedy") AND (year_written == "1595" OR year_written == "1596")

Note that the retrieval endpoint only supports exact match filtering. You cannot use range queries, partial matches, or complex operators.

Note that if you add a metadata file to a document that has already been processed, that will trigger the document to be processed again to add the metadata.

Troubleshooting

If the metadata you specified is not showing up in your vector index:

  • Is the filename correct? Each metadata filename must be exactly the same as the name of the document, with the addition of .metadata.json at the end.
  • Is the JSON properly formatted?
  • Check the document/chunk in your vector database to ensure it's formatted as expected.

Was this page helpful?