Skip to main content

Using the Extraction Tester

The Extraction Tester lets you test how different extraction methods process your documents before creating a RAG pipeline. This helps you choose the best extraction method for your specific documents and use case.

Getting Started

  1. Upload your document using the file selector or drag-and-drop interface.

Upload Document and Choose Settings

  1. Select an extraction type from the dropdown menu:
    • Fast: A lightweight extractor optimized for speed and basic document processing
    • Vectorize Iris: Our advanced document processing solution that intelligently preserves document structure and semantics

Select Extraction Type

  1. Configure extraction settings, then click Extract Text.
    • Chunk Size: Maximum number of characters in each chunk
    • Start Page: Beginning page for extraction (optional)
    • Page Limit: Maximum number of pages to process (optional)

Configure Extraction Settings

Extraction Results

After processing, you can view your document in four different formats:

  • Text: Raw extracted text

  • Text Chunks: Text split into chunks based on your specified chunk size

  • Markdown: Extracted content with preserved formatting

  • Markdown Chunks: Formatted content split into chunks

  • Metadata: Extracted metadata when using metadata schemas

    Extraction Results

Vectorize Iris

Vectorize Iris is a model-based extraction solution that transforms how RAG systems handle PDFs. It combines extraction and chunking into one streamlined process, making it easier than ever to get clean, usable text from complex documents.

When using Vectorize Iris, you'll notice enhanced processing features including:

  • Preserved document structure and formatting
  • Intelligent handling of tables and images
  • Advanced metadata extraction capabilities
  • Document classification and section identification

Metadata Extraction with Iris

Vectorize Iris can automatically extract structured metadata from your documents based on defined schemas. This capability allows you to:

  • Extract document-level metadata like title, author, and document type
  • Identify and extract section-specific metadata like part numbers, prices, or technical specifications
  • Test metadata extraction schemas before using them in a pipeline
  • Generate suggested schemas based on document analysis

To test metadata extraction:

  1. Upload your document
  2. Select "Vectorize Iris" as the extraction type
  3. Choose existing metadata schemas or select "Generate Schema Automatically"
  4. View the extracted metadata in the "Metadata" tab of the results

For more details about Iris' capabilities, see Understanding Vectorize Iris.

Common Use Cases

Test your document extraction when you want to:

  • Compare different extraction methods
  • Verify how complex documents like PDFs will be processed
  • Check if tables and formatting are preserved as needed
  • Ensure your documents are processed correctly before creating a pipeline

Was this page helpful?