Using the Extraction Tester
The Extraction Tester lets you test how different extraction methods process your documents before creating a RAG pipeline. This helps you choose the best extraction method for your specific documents and use case.
Getting Started
- Upload your document using the file selector or drag-and-drop interface.
- Select an extraction type from the dropdown menu:
- Fast: A lightweight extractor optimized for speed and basic document processing
- Vectorize Iris: Our advanced document processing solution that intelligently preserves document structure and semantics
- Configure extraction settings, then click Extract Text.
- Chunk Size: Maximum number of characters in each chunk
- Start Page: Beginning page for extraction (optional)
- Page Limit: Maximum number of pages to process (optional)
Extraction Results
After processing, you can view your document in four different formats:
-
Text: Raw extracted text
-
Text Chunks: Text split into chunks based on your specified chunk size
-
Markdown: Extracted content with preserved formatting
-
Markdown Chunks: Formatted content split into chunks
-
Metadata: Extracted metadata when using metadata schemas
Vectorize Iris
Vectorize Iris is a model-based extraction solution that transforms how RAG systems handle PDFs. It combines extraction and chunking into one streamlined process, making it easier than ever to get clean, usable text from complex documents.
When using Vectorize Iris, you'll notice enhanced processing features including:
- Preserved document structure and formatting
- Intelligent handling of tables and images
- Advanced metadata extraction capabilities
- Document classification and section identification
Metadata Extraction with Iris
Vectorize Iris can automatically extract structured metadata from your documents based on defined schemas. This capability allows you to:
- Extract document-level metadata like title, author, and document type
- Identify and extract section-specific metadata like part numbers, prices, or technical specifications
- Test metadata extraction schemas before using them in a pipeline
- Generate suggested schemas based on document analysis
To test metadata extraction:
- Upload your document
- Select "Vectorize Iris" as the extraction type
- Choose existing metadata schemas or select "Generate Schema Automatically"
- View the extracted metadata in the "Metadata" tab of the results
For more details about Iris' capabilities, see Understanding Vectorize Iris.
Common Use Cases
Test your document extraction when you want to:
- Compare different extraction methods
- Verify how complex documents like PDFs will be processed
- Check if tables and formatting are preserved as needed
- Ensure your documents are processed correctly before creating a pipeline