Vectorize Iris SDKs
Use the Vectorize Iris Python or Node.js SDK to programmatically extract text from documents in your applications. Ideal for production systems that need to process documents at scale.
Setup
Set your API credentials as environment variables:
export VECTORIZE_TOKEN="your-token"
export VECTORIZE_ORG_ID="your-org-id"
Python SDK
Installation
pip install vectorize-iris
Basic Text Extraction
from vectorize_iris import extract_text_from_file
result = extract_text_from_file('document.pdf')
print(result.text)
Custom Parsing Instructions
Guide the extraction with specific instructions:
from vectorize_iris import extract_text_from_file, ExtractionOptions
result = extract_text_from_file('document.pdf', options=ExtractionOptions(
parsing_instructions='Focus on extracting tables and ignore headers/footers'
))
print(result.text)
Semantic Chunking
Split documents into semantic chunks:
from vectorize_iris import extract_text_from_file, ExtractionOptions
result = extract_text_from_file('document.pdf', options=ExtractionOptions(
chunk_size=512
))
for chunk in result.chunks:
print(chunk)
Metadata Extraction
Extract structured metadata using a schema:
from vectorize_iris import extract_text_from_file, ExtractionOptions
result = extract_text_from_file('invoice.pdf', options=ExtractionOptions(
metadata_schemas=[{
'id': 'invoice-data',
'schema': {
'invoice_number': 'string',
'date': 'string',
'total_amount': 'number',
'vendor_name': 'string',
'items': [{
'description': 'string',
'quantity': 'number',
'price': 'number'
}]
}
}]
))
print(result.metadata)
Document Classification
Classify documents using multiple schemas:
from vectorize_iris import extract_text_from_file, ExtractionOptions
result = extract_text_from_file('document.pdf', options=ExtractionOptions(
metadata_schemas=[
{'id': 'invoice', 'schema': {'invoice_number': 'string', 'total': 'number'}},
{'id': 'receipt', 'schema': {'merchant': 'string', 'amount': 'number'}},
{'id': 'contract', 'schema': {'parties': ['string'], 'effective_date': 'string'}},
]
))
# Iris will match the document to the most appropriate schema
print(result.metadata)
Complete Example
from vectorize_iris import extract_text_from_file, ExtractionOptions
# Extract with all options
result = extract_text_from_file(
'financial-report.pdf',
options=ExtractionOptions(
chunk_size=256,
parsing_instructions='Extract financial data, tables, and key metrics',
metadata_schemas=[{
'id': 'financial-report',
'schema': {
'company_name': 'string',
'report_date': 'string',
'revenue': 'number',
'expenses': 'number',
'net_income': 'number'
}
}]
)
)
print(f"Extracted text: {result.text}")
print(f"Chunks: {len(result.chunks)}")
print(f"Metadata: {result.metadata}")
Node.js/TypeScript SDK
Installation
npm install @vectorize-io/iris
Basic Text Extraction
import { extractTextFromFile } from '@vectorize-io/iris';
const result = await extractTextFromFile('document.pdf');
console.log(result.text);
Custom Parsing Instructions
import { extractTextFromFile } from '@vectorize-io/iris';
const result = await extractTextFromFile('document.pdf', {
parsingInstructions: 'Focus on extracting tables and ignore headers/footers'
});
console.log(result.text);
Semantic Chunking
import { extractTextFromFile } from '@vectorize-io/iris';
const result = await extractTextFromFile('document.pdf', {
chunkSize: 512
});
result.chunks.forEach(chunk => {
console.log(chunk);
});
Metadata Extraction
import { extractTextFromFile } from '@vectorize-io/iris';
const result = await extractTextFromFile('invoice.pdf', {
metadataSchemas: [{
id: 'invoice-data',
schema: {
invoice_number: 'string',
date: 'string',
total_amount: 'number',
vendor_name: 'string',
items: [{
description: 'string',
quantity: 'number',
price: 'number'
}]
}
}]
});
console.log(result.metadata);
Document Classification
import { extractTextFromFile } from '@vectorize-io/iris';
const result = await extractTextFromFile('document.pdf', {
metadataSchemas: [
{ id: 'invoice', schema: { invoice_number: 'string', total: 'number' } },
{ id: 'receipt', schema: { merchant: 'string', amount: 'number' } },
{ id: 'contract', schema: { parties: ['string'], effective_date: 'string' } },
]
});
// Iris will match the document to the most appropriate schema
console.log(result.metadata);
Complete Example
import { extractTextFromFile } from '@vectorize-io/iris';
// Extract with all options
const result = await extractTextFromFile('financial-report.pdf', {
chunkSize: 256,
parsingInstructions: 'Extract financial data, tables, and key metrics',
metadataSchemas: [{
id: 'financial-report',
schema: {
company_name: 'string',
report_date: 'string',
revenue: 'number',
expenses: 'number',
net_income: 'number'
}
}]
});
console.log(`Extracted text: ${result.text}`);
console.log(`Chunks: ${result.chunks.length}`);
console.log(`Metadata:`, result.metadata);
Common Integration Patterns
Processing Uploaded Files
Python (Flask):
from flask import Flask, request
from vectorize_iris import extract_text_from_file
@app.route('/upload', methods=['POST'])
def upload_file():
file = request.files['file']
file.save('temp.pdf')
result = extract_text_from_file('temp.pdf')
return {'text': result.text, 'metadata': result.metadata}
Node.js (Express):
import express from 'express';
import { extractTextFromFile } from '@vectorize-io/iris';
app.post('/upload', async (req, res) => {
const file = req.files.file;
await file.mv('temp.pdf');
const result = await extractTextFromFile('temp.pdf');
res.json({ text: result.text, metadata: result.metadata });
});
Batch Processing with Error Handling
Python:
import os
from vectorize_iris import extract_text_from_file, ExtractionOptions
def process_directory(input_dir, output_dir):
for filename in os.listdir(input_dir):
if filename.endswith('.pdf'):
try:
result = extract_text_from_file(
os.path.join(input_dir, filename),
options=ExtractionOptions(chunk_size=512)
)
output_file = os.path.join(output_dir, f"{filename}.json")
with open(output_file, 'w') as f:
f.write(result.to_json())
print(f"✓ Processed {filename}")
except Exception as e:
print(f"✗ Failed {filename}: {e}")
Node.js:
import { readdir } from 'fs/promises';
import { extractTextFromFile } from '@vectorize-io/iris';
import { writeFile } from 'fs/promises';
async function processDirectory(inputDir: string, outputDir: string) {
const files = await readdir(inputDir);
for (const filename of files) {
if (filename.endsWith('.pdf')) {
try {
const result = await extractTextFromFile(
`${inputDir}/${filename}`,
{ chunkSize: 512 }
);
await writeFile(
`${outputDir}/${filename}.json`,
JSON.stringify(result)
);
console.log(`✓ Processed ${filename}`);
} catch (e) {
console.error(`✗ Failed ${filename}:`, e);
}
}
}
}
Next Steps
- Try the CLI tool for quick testing
- Learn more about Vectorize Iris
- Use Iris in RAG Pipelines
- Test extraction in the Extraction Tester
- Explore the vectorize-iris GitHub repository