Extract Data from PDFs using Vectorize Iris
note
The API is currently in Beta.
Learn how to use the Vectorize API to extract text from unstructured data (PDF, Documents, images, and more) using Vectorize Iris.
Prerequisites
Before you begin, you'll need:
- A Vectorize account
- An API access token (create one here)
- Your organization ID (see below)
Finding your Organization ID
Your organization ID is in the Vectorize platform URL:
https://platform.vectorize.io/organization/[YOUR-ORG-ID]
For example, if your URL is:
https://platform.vectorize.io/organization/ecf3fa1d-30d0-4df1-8af6-f4852bc851cb
Your organization ID is: ecf3fa1d-30d0-4df1-8af6-f4852bc851cb
Start the extraction
First, we need to upload the file that we want to extract text from.
- Python
- Node.js
import urllib3
import os
# Create API instances
files_api = v.FilesApi(api)
extraction_api = v.ExtractionApi(api)
# File details
content_type = "application/pdf"
file_path = "path/to/file.pdf"
# Start file upload
start_file_upload_response = files_api.start_file_upload(
org_id,
start_file_upload_request=v.StartFileUploadRequest(
content_type=content_type,
name="My file.pdf",
)
)
# Upload the file
http = urllib3.PoolManager()
with open(file_path, "rb") as f:
response = http.request(
"PUT",
start_file_upload_response.upload_url,
body=f,
headers={
"Content-Type": content_type,
"Content-Length": str(os.path.getsize(file_path))
}
)
if response.status != 200:
print("Upload failed:", response.data)
else:
print("Upload successful")
# Start extraction
response = extraction_api.start_extraction(
org_id,
start_extraction_request=v.StartExtractionRequest(
file_id=start_file_upload_response.file_id
)
)
extraction_id = response.extraction_id
const fs = require('fs');
// Create API instances
const filesApi = new v.FilesApi(api);
const extractionApi = new v.ExtractionApi(api);
// File details
const contentType = "application/pdf";
const filePath = "path/to/file.pdf";
// Start file upload
const startResponse = await filesApi.startFileUpload({
organization: orgId,
startFileUploadRequest: {
name: "My File.pdf",
contentType: contentType
}
});
// Upload the file
const fileBuffer = fs.readFileSync(filePath);
const fetchResponse = await fetch(startResponse.uploadUrl, {
method: 'PUT',
body: fileBuffer,
headers: {
'Content-Type': contentType
},
});
if (!fetchResponse.ok) {
throw new Error(`Failed to upload file: ${fetchResponse.statusText}`);
}
console.log("Upload successful");
// Start extraction
const response = await extractionApi.startExtraction({
organization: orgId,
startExtractionRequest: {
fileId: startResponse.fileId,
// the extraction will also chunk the file as it would do in a RAG pipeline
chunkSize: 512,
}
});
const extractionId = response.extractionId;
Get the Extraction result
Extraction runs asynchronously. Use the extraction ID to check the status and retrieve your results.
- Python
- Node.js
import time
while True:
response = extraction_api.get_extraction_result(org_id, extraction_id)
if response.ready:
if response.data.success:
print(response.data.text)
else:
print("Extraction failed:", response.data.error)
break
print("Extraction in progress...")
time.sleep(2) # Wait 2 seconds before checking again
// Helper function to sleep
const sleep = (ms) => new Promise(resolve => setTimeout(resolve, ms));
while (true) {
const result = await extractionApi.getExtractionResult({
organization: orgId,
extractionId: extractionId,
});
if (result.ready) {
if (result.data.success) {
console.log(result.data.text);
} else {
console.log("Extraction failed:", result.data.error);
}
break;
}
console.log("Extraction in progress...");
await sleep(2000); // Wait 2 seconds before checking again
}