Bulk Download API
The Bulk Download API allows you to retrieve approved records from the HITL queue programmatically. It supports three modes of operation based on your data size and requirements.
Prerequisites
Before using the Bulk Download API, ensure you have:
- An API key created for Human Quality Review (see Retrieving Approved Results)
- Your organization ID (found in your ETL endpoint URL)
- Your class ID (found in Download and Sync Manager)
API Endpoint
GET https://us-central.unstract.com/mr/api/{organization_id}/approved/result/{class_id}/
Authentication
Include your API key in the Authorization header:
Authorization: Bearer <api_key>
Query Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
page | integer | 1 | Page number for pagination |
page_size | integer | 50 | Number of records per page (1-500) |
download_files | boolean | false | Whether to include file content in response |
email | string | null | Email address for async download notifications |
Download Modes
The API operates in three modes depending on your parameters and data size:
Mode 1: Metadata Only (download_files=false)
Returns metadata about approved records without file content. Use this to query what's available before downloading.
curl --location 'https://us-central.unstract.com/mr/api/<organization_id>/approved/result/<class_id>/?page=1&page_size=50&download_files=false' \
--header 'Authorization: Bearer <api_key>'
Response:
{
"results": [
{
"file": "invoice_001.pdf",
"status": "approved",
"workflow_id": "abc123-...",
"file_execution_id": "exec-456...",
"hitl_queue_name": "my_queue"
}
],
"pagination": {
"total_records": 150,
"page": 1,
"page_size": 50,
"total_pages": 3
}
}
Mode 2: Synchronous Download (download_files=true, small files)
When the total file size is under the threshold (default: 200MB), files are returned directly in the JSON response.
curl --location 'https://us-central.unstract.com/mr/api/<organization_id>/approved/result/<class_id>/?download_files=true&page_size=10' \
--header 'Authorization: Bearer <api_key>'
Response:
{
"results": [
{
"file": "invoice_001.pdf",
"file_content": "<base64_encoded_content>",
"status": "approved",
"result": {
"invoice_number": "INV-001",
"total_amount": "1500.00"
},
"workflow_id": "abc123-...",
"file_execution_id": "exec-456..."
}
],
"pagination": {
"total_records": 10,
"page": 1,
"page_size": 10,
"total_pages": 1
},
"total_size_bytes": 5242880
}
Mode 3: Asynchronous Download (download_files=true, large files)
When the total file size exceeds the threshold, the API creates a background job to prepare a ZIP archive.
curl --location 'https://us-central.unstract.com/mr/api/<organization_id>/approved/result/<class_id>/?download_files=true&page_size=100&email=notify@example.com' \
--header 'Authorization: Bearer <api_key>'
Initial Response (HTTP 202 Accepted):
{
"job_id": "job-789...",
"status": "processing",
"total_files": 100,
"total_size_bytes": 524288000,
"message": "Files are large. Creating zip archive in background. Check status at the provided URL.",
"status_url": "/mr/api/<organization_id>/approved/download-status/job-789.../"
}
Checking Async Download Status
Poll the status endpoint to check if your download is ready:
curl --location 'https://us-central.unstract.com/mr/api/<organization_id>/approved/download-status/<job_id>/' \
--header 'Authorization: Bearer <api_key>'
Response (Processing):
{
"job_id": "job-789...",
"status": "processing",
"total_files": 100,
"processed_files": 45,
"total_size_mb": 500.00,
"progress_percentage": 45.0,
"message": "Download is being prepared. Please check back shortly."
}
Response (Completed):
{
"job_id": "job-789...",
"status": "completed",
"total_files": 100,
"processed_files": 100,
"total_size_mb": 500.00,
"download_url": "/mr/api/<organization_id>/approved/download/job-789.../",
"expires_at": "2024-01-02T12:00:00Z"
}
Response (Failed):
{
"job_id": "job-789...",
"status": "failed",
"total_files": 100,
"processed_files": 50,
"error_message": "Storage quota exceeded"
}
Downloading the ZIP Archive
Once the job is completed, download the ZIP file:
curl --location 'https://us-central.unstract.com/mr/api/<organization_id>/approved/download/<job_id>/' \
--header 'Authorization: Bearer <api_key>' \
--output approved_records.zip
The ZIP archive contains:
metadata.json- Complete metadata for all recordsfiles/- Directory containing all the source files
Each download can only be consumed once. After downloading, the ZIP file is removed from storage. If you need the files again, you'll need to create a new download request.
Email Notifications
For async downloads, you can receive an email notification when the download is ready:
- Pass your email in the
emailparameter - If no email is provided, notifications are sent to the API key owner's email
The notification email includes:
- Direct download link
- Number of files included
- Total file size
- Expiration time for the download
Best Practices
Pagination
For large queues, use pagination to retrieve records in manageable batches:
# Get first page
curl 'https://.../approved/result/<class_id>/?page=1&page_size=100&download_files=false'
# Get subsequent pages
curl 'https://.../approved/result/<class_id>/?page=2&page_size=100&download_files=false'
Efficient Retrieval
- Check metadata first: Use
download_files=falseto see how many records are available - Use appropriate page sizes: Larger page sizes reduce API calls but increase response time
- Monitor async jobs: Poll the status endpoint every 10-30 seconds for async downloads
Error Handling
| Status Code | Meaning |
|---|---|
| 200 | Success - records returned |
| 202 | Accepted - async job created |
| 400 | Bad request - invalid parameters |
| 401 | Unauthorized - invalid or missing API key |
| 404 | Not found - job or class ID doesn't exist |
| 410 | Gone - download already consumed |
| 500 | Server error |
Configuration Limits
The following limits are configurable per environment:
| Setting | Default | Description |
|---|---|---|
| Max page size | 500 | Maximum records per page |
| Sync download threshold | 200 MB | Files under this size return synchronously |
| Max bulk download files | 1000 | Maximum files per download request |
| Download URL expiry | 24 hours | Time before download link expires |
| Job retention | 7 days | How long completed jobs are kept |
Example: Complete Workflow
import requests
import time
BASE_URL = "https://us-central.unstract.com"
ORG_ID = "your_org_id"
CLASS_ID = "your_class_id"
API_KEY = "your_api_key"
headers = {"Authorization": f"Bearer {API_KEY}"}
# Step 1: Check available records
response = requests.get(
f"{BASE_URL}/mr/api/{ORG_ID}/approved/result/{CLASS_ID}/",
params={"download_files": "false", "page_size": 100},
headers=headers
)
data = response.json()
print(f"Total records available: {data['pagination']['total_records']}")
# Step 2: Request download with email notification
response = requests.get(
f"{BASE_URL}/mr/api/{ORG_ID}/approved/result/{CLASS_ID}/",
params={
"download_files": "true",
"page_size": 100,
"email": "your-email@example.com"
},
headers=headers
)
if response.status_code == 202:
# Async download initiated
job_data = response.json()
job_id = job_data["job_id"]
print(f"Async job created: {job_id}")
# Step 3: Poll for completion
while True:
status_response = requests.get(
f"{BASE_URL}/mr/api/{ORG_ID}/approved/download-status/{job_id}/",
headers=headers
)
status_data = status_response.json()
if status_data["status"] == "completed":
download_url = status_data["download_url"]
print(f"Download ready: {download_url}")
break
elif status_data["status"] == "failed":
print(f"Job failed: {status_data.get('error_message')}")
break
else:
progress = status_data.get("progress_percentage", 0)
print(f"Processing: {progress}%")
time.sleep(15)
# Step 4: Download the ZIP file
if status_data["status"] == "completed":
download_response = requests.get(
f"{BASE_URL}{download_url}",
headers=headers
)
with open("approved_records.zip", "wb") as f:
f.write(download_response.content)
print("Download complete!")
elif response.status_code == 200:
# Sync download - files included in response
data = response.json()
print(f"Downloaded {len(data['results'])} records synchronously")