Extract Data from PDFs using Vectorize Iris
note
The API is currently in Beta.
Learn how to use the Vectorize API to extract text from unstructured data (PDF, Documents, images, and more) using Vectorize Iris.
Prerequisites
Before you begin, you'll need:
- A Vectorize account
- An API access token (how to create one)
- Your organization ID (see below)
Finding your Organization ID
Your organization ID is in the Vectorize platform URL:
https://platform.vectorize.io/organization/[YOUR-ORG-ID]
For example, if your URL is:
https://platform.vectorize.io/organization/ecf3fa1d-30d0-4df1-8af6-f4852bc851cb
Your organization ID is: ecf3fa1d-30d0-4df1-8af6-f4852bc851cb
Start the extraction
First, we need to upload the file that we want to extract text from.
- Python
- Node.js
import urllib3
import os
# Initialize the client
api = v.ApiClient(v.Configuration(access_token="your-api-token"))
# Create API instances
files_api = v.FilesApi(api)
extraction_api = v.ExtractionApi(api)
# File details
content_type = "application/pdf"
file_path = "path/to/file.pdf"
# Start file upload
start_file_upload_response = files_api.start_file_upload(
"your-organization-id",
start_file_upload_request=v.StartFileUploadRequest(
content_type=content_type,
name="My file.pdf",
)
)
# Upload the file
http = urllib3.PoolManager()
with open(file_path, "rb") as f:
response = http.request(
"PUT",
start_file_upload_response.upload_url,
body=f,
headers={
"Content-Type": content_type,
"Content-Length": str(os.path.getsize(file_path))
}
)
if response.status != 200:
print("Upload failed:", response.data)
else:
print("Upload successful")
# Start extraction
response = extraction_api.start_extraction(
"your-organization-id",
start_extraction_request=v.StartExtractionRequest(
file_id=start_file_upload_response.file_id
)
)
extraction_id = response.extraction_id
const { FilesApi, ExtractionApi, StartFileUploadRequest, StartExtractionRequest } = vectorize;
// Initialize the client
const apiConfig = {
basePath: 'https://api.vectorize.io',
accessToken: 'your-api-token'
};
// Create API instances
const filesApi = new FilesApi(apiConfig);
const extractionApi = new ExtractionApi(apiConfig);
// File details
const contentType = "application/pdf";
const filePath = "path/to/file.pdf";
// Start file upload
const startFileUploadResponse = await filesApi.startFileUpload({
organizationId: "your-organization-id",
startFileUploadRequest: {
contentType: contentType,
name: "My file.pdf"
}
});
// Upload the file
const fileBuffer = fs.readFileSync(filePath);
const fileStats = fs.statSync(filePath);
const uploadResponse = await fetch(startFileUploadResponse.uploadUrl, {
method: 'PUT',
body: fileBuffer,
headers: {
'Content-Type': contentType,
'Content-Length': fileStats.size.toString()
}
});
if (uploadResponse.status !== 200) {
const errorText = await uploadResponse.text();
console.log("Upload failed:", errorText);
} else {
console.log("Upload successful");
}
// Start extraction
const response = await extractionApi.startExtraction({
organizationId: "your-organization-id",
startExtractionRequest: {
fileId: startFileUploadResponse.fileId
}
});
const extractionId = response.extractionId;
Get the Extraction result
Extraction runs asynchronously. Use the extraction ID to check the status and retrieve your results.
- Python
- Node.js
import time
while True:
response = extraction_api.get_extraction_result("your-organization-id", "your-extraction-id")
if response.ready:
if response.data.success:
print(response.data.text)
else:
print("Extraction failed:", response.data.error)
break
print("Extraction in progress...")
time.sleep(2) # Wait 2 seconds before checking again
while (true) {
const response = await extractionApi.getExtractionResult({
organizationId: "your-organization-id",
extractionId: "your-extraction-id"
});
if (response.ready) {
if (response.data.success) {
console.log(response.data.text);
} else {
console.log("Extraction failed:", response.data.error);
}
break;
}
console.log("Extraction in progress...");
await new Promise(resolve => setTimeout(resolve, 2000)); // Wait 2 seconds
}