Using the Extraction Tester

The Extraction Tester lets you test how different extraction methods process your documents before creating a RAG pipeline. This helps you choose the best extraction method for your specific documents and use case.

Getting Started

Upload your document using the file selector or drag-and-drop interface.

Upload Document and Choose Settings

Select an extraction type from the dropdown menu:
- Fast: A lightweight extractor optimized for speed and basic document processing
- Vectorize Iris: Our advanced document processing solution that intelligently preserves document structure and semantics

Select Extraction Type

Configure extraction settings, then click Extract Text.
- Chunk Size: Maximum number of characters in each chunk
- Start Page: Beginning page for extraction (optional)
- Page Limit: Maximum number of pages to process (optional)

Configure Extraction Settings

Extraction Results

After processing, you can view your document in four different formats:

Text: Raw extracted text
Text Chunks: Text split into chunks based on your specified chunk size
Markdown: Extracted content with preserved formatting
Markdown Chunks: Formatted content split into chunks
Metadata: Extracted metadata when using metadata schemas

Vectorize Iris

Vectorize Iris is a model-based extraction solution that transforms how RAG systems handle PDFs. It combines extraction and chunking into one streamlined process, making it easier than ever to get clean, usable text from complex documents.

When using Vectorize Iris, you'll notice enhanced processing features including:

Preserved document structure and formatting
Intelligent handling of tables and images
Advanced metadata extraction capabilities
Document classification and section identification

Metadata Extraction with Iris

Vectorize Iris can automatically extract structured metadata from your documents based on defined schemas. This capability allows you to:

Extract document-level metadata like title, author, and document type
Identify and extract section-specific metadata like part numbers, prices, or technical specifications
Test metadata extraction schemas before using them in a pipeline
Generate suggested schemas based on document analysis

To test metadata extraction:

Upload your document
Select "Vectorize Iris" as the extraction type
Choose existing metadata schemas or select "Generate Schema Automatically"
View the extracted metadata in the "Metadata" tab of the results

For more details about Iris' capabilities, see Understanding Vectorize Iris.

Common Use Cases

Test your document extraction when you want to:

Compare different extraction methods
Verify how complex documents like PDFs will be processed
Check if tables and formatting are preserved as needed
Ensure your documents are processed correctly before creating a pipeline

Getting Started​

Extraction Results​

Vectorize Iris​

Metadata Extraction with Iris​

Common Use Cases​

Getting Started

Extraction Results

Vectorize Iris

Metadata Extraction with Iris

Common Use Cases