Skip to main content

Automatic Metadata Extraction

Automatic metadata extraction is a powerful feature that allows you to define schemas for extracting structured information from your documents. This extracted metadata can enhance your retrieval capabilities by providing additional context and filtering options.

Understanding Automatic Metadata Extraction

Automatic metadata extraction uses Vectorize's Iris model to analyze documents and extract structured information based on your defined schemas. There are two types of metadata that can be extracted:

Document Metadata

Document metadata is extracted by analyzing the entire document. It's best suited for high-level information that applies to the document as a whole, such as:

  • Title
  • Author
  • Document type
  • Publication date
  • Summary or conclusions
  • Overall document classification

Document metadata is attached to each chunk of the document, ensuring this high-level context is available regardless of which chunk is retrieved.

Note: Document classification occurs when applying document metadata. Based on the name of the metadata schema, the Iris model decides if it applies to the document or not.

Section Metadata

Section metadata is applied at the chunk level. For each chunk, the model determines if it matches one of your defined section metadata schemas. If it matches, it will extract that metadata and add it to the chunk.

Section metadata is ideal for more specific and detailed information that varies throughout the document, such as:

  • Part numbers
  • Items purchased
  • Values in dollars
  • Technical specifications
  • Status information

Note: If a chunk does not match a section schema, no section metadata is extracted for that chunk.

How Metadata is Stored and Used

Like other types of metadata, automatic metadata is attached to the vector database record and can be used for filtering:

  • Document metadata is stored under the document_metadata field in the database record
  • Section metadata is stored under the chunk_metadata field

This metadata can be used for filtering on the retrieval endpoint and is also visible on the document chunk view in the pipeline details documents table.

Billing for Metadata Extraction

When using automatic metadata extraction, usage is billed as follows:

  • Document metadata: Charged as one page per page in the document, regardless of which extractor you use (Fast or Iris). For example, if a document is 10 pages, you'll be charged for an additional 10 pages for document metadata extraction.

  • Section metadata: Charged as one Iris page per chunk. For example, if your document is split into 15 chunks, you'll be charged for 15 additional Iris pages.

This billing structure reflects the processing required for each type of metadata extraction. Metadata extraction works with both Fast and Iris extractors, though Iris is recommended for best results.

Creating and Managing Metadata Schemas

Accessing the Metadata Schema Editor

To create or manage metadata schemas:

  1. Navigate to your organization dashboard
  2. Select "Metadata Schemas" from the navigation menu
  3. Click "Create New Schema" to create a new schema

Creating a New Schema

When creating a new schema, you have three options:

  1. Start from Blank: Create a schema from scratch
  2. Use a Template: Start with a pre-defined schema for common document types
  3. Generate from Document: Upload a sample document and have the model automatically generate a possible schema

Option 1: Start from Blank

Starting from blank gives you a basic schema structure with empty document and sections objects:

{
"document": {
"type": "object",
"properties": {}
},
"sections": {}
}

Option 2: Use a Template

Vectorize provides built-in templates for common document types like receipts, invoices, and technical documentation. These templates include pre-defined properties that are commonly found in these document types.

Option 3: Generate from Document

This option allows you to upload a sample document and have the Iris model analyze it to generate a suggested schema:

  1. Click "Upload Document for Analysis"
  2. Upload your document (PDF, DOCX, TXT, etc.)
  3. Wait for the analysis to complete
  4. View the extracted results in the Document Metadata tab
  5. Review and edit the generated schema

Editing a Schema

Vectorize provides a visual schema editor that makes it easy to define and manage your metadata schemas without needing to write JSON directly. The visual editor is the recommended way to create and edit schemas.

The schema consists of two main sections:

  1. document: Defines the properties to extract at the document level
  2. sections: Defines different section types and their properties

With the visual editor, you can:

  • Add, edit, and remove properties
  • Set property types (string, number, boolean, etc.)
  • Add descriptions for properties
  • Define enumeration values for properties
  • Preview how the schema will be applied

For advanced users, there is also a raw JSON mode available, but the visual editor is recommended for most users.

Example schema structure (shown here as JSON for reference):

{
"document": {
"type": "object",
"properties": {
"title": {
"type": "string",
"description": "The title of the document"
},
"author": {
"type": "string",
"description": "The author of the document"
},
"document_type": {
"type": "string",
"enum": ["report", "article", "specification"],
"description": "The type of document"
}
}
},
"sections": {
"product_specification": {
"type": "object",
"properties": {
"product_name": {
"type": "string",
"description": "Name of the product"
},
"part_number": {
"type": "string",
"description": "Product part number"
},
"price": {
"type": "number",
"description": "Product price in dollars"
}
}
},
"technical_details": {
"type": "object",
"properties": {
"dimensions": {
"type": "string",
"description": "Physical dimensions of the product"
},
"weight": {
"type": "string",
"description": "Weight of the product"
}
}
}
}
}

The schema editor also provides tabs to:

  • Edit the schema using the visual editor (recommended)
  • View the raw JSON (for advanced users)
  • View an example of document metadata
  • View examples of section metadata

Note: Schemas are saved per organization. Deleting or modifying a saved schema that was previously used when creating a pipeline does not affect the schema used by the pipeline. The pipeline will need to be updated with the new schema for changes to take effect.

Testing Metadata Extraction

The extraction tester allows you to test your metadata schemas against specific documents before using them in a pipeline.

To use the extraction tester:

  1. Navigate to "Extraction Tester" in your organization dashboard
  2. Upload a document
  3. Select the metadata schemas you want to test, or choose to generate a schema automatically
  4. Click "Test Extraction"

The results will show both the text of the document and the extracted metadata, allowing you to verify that your schemas are working as expected.

Generating Schemas During Extraction Testing

The extraction tester also provides an option to automatically generate a metadata schema based on the uploaded document:

  1. Upload your document to the extraction tester
  2. Select "Generate Schema Automatically" option
  3. Run the extraction test
  4. Review the generated schema and extracted metadata
  5. If satisfied with the results, you can save the generated schema for future use in pipelines

Using Metadata Extraction in Pipelines

To use automatic metadata extraction in your RAG pipeline:

  1. Create a new pipeline or edit an existing one
  2. In the pipeline editor, enable the metadata extraction node
  3. Select the metadata schemas you want to apply
  4. Configure metadata settings (see below)

Metadata Settings

The pipeline editor provides two important settings for metadata:

  1. Add document metadata to chunks: When enabled, document metadata will be added to the text of each chunk. This can improve retrieval, especially for documents that span many chunks.

  2. Add section metadata to chunks: When enabled, section metadata will be added to the text of the chunk. This can improve retrieval for documents with very specific value strings.

When these options are enabled, the metadata is appended to the text chunks before vectorization. Document metadata is added first, followed by section metadata (if configured and available). This process ensures that the metadata is considered during semantic similarity search on the retrieval endpoint.

Adding metadata to chunks can significantly improve retrieval quality by:

  • Making high-level document context available in every chunk
  • Emphasizing important information for the LLM to use when generating answers
  • Providing additional context that might not be explicit in the text
  • Ensuring consistent information is available across all chunks from the same document

Viewing Extracted Metadata

You can view the extracted metadata for your documents in the RAG pipeline document tab. This provides a convenient way to verify that your metadata schemas are working as expected and to see the actual metadata that has been extracted from your documents.

The document tab displays both document metadata and section metadata for each chunk, allowing you to see how the metadata is associated with specific parts of your documents.

Best Practices

  • Document Metadata: Use for high-level information that applies to the entire document
  • Section Metadata: Use for specific details that vary throughout the document
  • Schema Design: Start with templates or document analysis and refine as needed
  • Testing: Always test your schemas with the extraction tester before using them in production
  • Extractor Choice: While automatic metadata extraction works with both the Fast and Iris extractors, the Iris extractor is recommended for best results
  • Metadata to Chunks: Consider enabling the "add metadata to chunks" options for improved retrieval, especially for documents with many chunks or specific value strings

What's Next?

Was this page helpful?