Extract Data from PDFs using Vectorize Iris

note

The API is currently in Beta.

Learn how to use the Vectorize API to extract text from unstructured data (PDF, Documents, images, and more) using Vectorize Iris.

Prerequisites

Before you begin, you'll need:

A Vectorize account
An API access token (how to create one)
Your organization ID (see below)

Finding your Organization ID

Your organization ID is in the Vectorize platform URL:

https://platform.vectorize.io/organization/[YOUR-ORG-ID]

For example, if your URL is:

https://platform.vectorize.io/organization/ecf3fa1d-30d0-4df1-8af6-f4852bc851cb

Your organization ID is: ecf3fa1d-30d0-4df1-8af6-f4852bc851cb

Start the extraction

First, we need to upload the file that we want to extract text from.

Python
Node.js
import urllib3
import os

# Initialize the client
api = v.ApiClient(v.Configuration(access_token="your-api-token"))

# Create API instances
files_api = v.FilesApi(api)
extraction_api = v.ExtractionApi(api)

# File details
content_type = "application/pdf"
file_path = "path/to/file.pdf"

# Start file upload
start_file_upload_response = files_api.start_file_upload(
    "your-organization-id", 
    start_file_upload_request=v.StartFileUploadRequest(
        content_type=content_type,
        name="My file.pdf",
    )
)

# Upload the file
http = urllib3.PoolManager()

with open(file_path, "rb") as f:
    response = http.request(
        "PUT", 
        start_file_upload_response.upload_url, 
        body=f, 
        headers={
            "Content-Type": content_type, 
            "Content-Length": str(os.path.getsize(file_path))
        }
    )
    if response.status != 200:
        print("Upload failed:", response.data)
    else:
        print("Upload successful")

# Start extraction
response = extraction_api.start_extraction(
    "your-organization-id", 
    start_extraction_request=v.StartExtractionRequest(
        file_id=start_file_upload_response.file_id
    )
)
extraction_id = response.extraction_id
const { FilesApi, ExtractionApi, StartFileUploadRequest, StartExtractionRequest } = vectorize;

// Initialize the client
const apiConfig = {
    basePath: 'https://api.vectorize.io',
    accessToken: 'your-api-token'
};

// Create API instances
const filesApi = new FilesApi(apiConfig);
const extractionApi = new ExtractionApi(apiConfig);

// File details
const contentType = "application/pdf";
const filePath = "path/to/file.pdf";

// Start file upload
const startFileUploadResponse = await filesApi.startFileUpload({
    organizationId: "your-organization-id",
    startFileUploadRequest: {
        contentType: contentType,
        name: "My file.pdf"
    }
});

// Upload the file
const fileBuffer = fs.readFileSync(filePath);
const fileStats = fs.statSync(filePath);

const uploadResponse = await fetch(startFileUploadResponse.uploadUrl, {
    method: 'PUT',
    body: fileBuffer,
    headers: {
        'Content-Type': contentType,
        'Content-Length': fileStats.size.toString()
    }
});

if (uploadResponse.status !== 200) {
    const errorText = await uploadResponse.text();
    console.log("Upload failed:", errorText);
} else {
    console.log("Upload successful");
}

// Start extraction
const response = await extractionApi.startExtraction({
    organizationId: "your-organization-id",
    startExtractionRequest: {
        fileId: startFileUploadResponse.fileId
    }
});
const extractionId = response.extractionId;

Get the Extraction result

Extraction runs asynchronously. Use the extraction ID to check the status and retrieve your results.

Python
Node.js
import time

while True:
    response = extraction_api.get_extraction_result("your-organization-id", "your-extraction-id")
    if response.ready:
        if response.data.success:
            print(response.data.text)
        else:
            print("Extraction failed:", response.data.error)
        break
    print("Extraction in progress...")
    time.sleep(2)  # Wait 2 seconds before checking again
while (true) {
    const response = await extractionApi.getExtractionResult({
        organizationId: "your-organization-id",
        extractionId: "your-extraction-id"
    });

    if (response.ready) {
        if (response.data.success) {
            console.log(response.data.text);
        } else {
            console.log("Extraction failed:", response.data.error);
        }
        break;
    }
    console.log("Extraction in progress...");
    await new Promise(resolve => setTimeout(resolve, 2000)); // Wait 2 seconds
}

Prerequisites

Finding your Organization ID

Start the extraction​

Get the Extraction result​

Start the extraction

Get the Extraction result