Using User-defined Metadata

User-defined metadata enables you to associate custom fields of information with your documents by providing metadata files alongside your source documents.

Supported source connectors

Supported source connectors:

  • Amazon S3

  • Azure Blog Storage

  • Dropbox

  • File Upload

  • Google Cloud Storage

  • Google Drive

Create User-defined Metadata

For each document you'd like to associate user-defined metadata with, create a file in valid JSON format containing the metadata.

The metadata fields must be inside the "metadata" tag:

    {
        "metadata" : {
          "field-1" : "value",
          "field-2": ["value1", "value2"]
        }
    }

The filename must be exactly the same as the name of the document, with the addition of .metadata.json at the end. For example:

document.pdf → document.pdf.metadata.json
example.txt → example.txt.metadata.json

A different schema can be used for each document.

If metadata files are malformed, or if their names don't follow the naming format, the document (not just the metadata file) will be considered unprocessable and will be skipped.

Database-specific Formatting

All vector databases fully support JSON datatypes including nested JSON, except for Pinecone.

Pinecone only supports metadata fields of the following types:

  • Boolean

  • String

  • Numbers

  • Lists of strings

User-defined Metadata Example

Let's look at a simple example using a collection of Shakespeare's plays.

Suppose you have these files:

plays/
  ├── a-midsummer-nights-dream.pdf
  ├── hamlet.pdf
  ├── king-lear.pdf
  ├── macbeth.pdf
  ├── the-taming-of-the-shrew.pdf

For each file, we'll create a metadata file containing JSON that contains information about the play. For example:

    {
        "metadata" : {
            "title": "A Midsummer Night's Dream",
            "genre": "comedy",
            "year_written": "1595",
            "setting": ["Athens", "Nearby Forest"],
            "characters": ["Puck", "Oberon", "Titania", "Lysander", "Hermia", "Demetrius", "Helena", "Bottom"],
            "number_of_acts": 5,
            "themes": ["love", "magic", "nature", "dreams", "mischief"],
            "famous_quotes": [
            "Lord, what fools these mortals be!",
            "The course of true love never did run smooth"
            ],
            "length": {
            "words": 16087,
            "lines": 2349
            }
        }
    }

In this example, the metadata allows you to build more precise queries for your AI application, like:

  • Find all tragedies written before 1600

  • Search for plays with themes of revenge

  • Look for specific characters or settings

  • Filter by genre while doing semantic search

The filename must be exactly the same as the name of the document, with the addition of .metadata.json at the end.

Our resulting file list looks like this:

plays/
├── a-midsummer-nights-dream.pdf
├── a-midsummer-nights-dream.pdf.metadata.json
├── hamlet.pdf
├── hamlet.pdf.metadata.json
├── king-lear.pdf
├── king-lear.pdf.metadata.json
├── macbeth.pdf
├── macbeth.pdf.metadata.json
├── the-taming-of-the-shrew.pdf
├── the-taming-of-the-shrew.pdf.metadata.json

All of these files—the original documents as well as the JSON files—must be provided to your RAG pipeline.

Walkthrough Using Vectorize

We'll walk through this example using the File Upload connector. If you're not familiar with RAG pipelines, check out Pipeline Basics.

  1. Create a new RAG pipeline. For this example, we'll use Pinecone and OpenAI.

  2. Specify the metadata fields you'd like to filter on using your Vectorize retrieval endpoint when configuring your vector database. Make sure to enter metadata. before each field. For more information about retrieval endpoints, see Using the Retrieval Endpoint

  1. Add a File Upload connector to your pipeline, and specify both the plays and the metadata files. Click Confirm Selection when done.

  1. Schedule your RAG pipeline as normal.

  2. Once your pipeline populates your database, you'll be able to see the metadata in your index. This screenshot shows metadata we provided in a document record in a Pinecone index.

  1. To access user-defined metadata using your retrieval endpoint, filter the same way you would for other types of metadata, making sure to enter metadata. before each field. For example:

    curl -L \
    -H "Content-Type: application/json" \
    -H "Authorization: <TOKEN>" \
    -d '{
        "question": "Which plays are tragedies?",
        "numResults": 5,
        "metadata-filters": [
        {
        "metadata.genre": ["tragedy"]
        }
        ]
    }' \
    "https://client.app.vectorize.io/api/gateways/service/o9bf-57d2db385ba2/p24a5ecc3/retrieve"

Note that if you add a metadata file to a document that has already been processed, that will trigger the document to be processed again to add the metadata.

Troubleshooting

If the metadata you specified is not showing up in your vector index:

  • Is the filename correct? Each metadata filename must be exactly the same as the name of the document, with the addition of .metadata.json at the end.

  • Is the JSON properly formatted?

  • Check the document/chunk in your vector database to ensure it's formatted as expected.

Last updated