Skip to main content

Bulk Download API

The Bulk Download API allows you to retrieve approved records from the HITL queue programmatically. It supports three modes of operation based on your data size and requirements.

Prerequisites

Before using the Bulk Download API, ensure you have:

  1. An API key created for Human Quality Review (see Retrieving Approved Results)
  2. Your organization ID (found in your ETL endpoint URL)
  3. Your class ID (found in Download and Sync Manager)

API Endpoint

GET https://us-central.unstract.com/mr/api/{organization_id}/approved/result/{class_id}/

Authentication

Include your API key in the Authorization header:

Authorization: Bearer <api_key>

Query Parameters

ParameterTypeDefaultDescription
pageinteger1Page number for pagination
page_sizeinteger50Number of records per page (1-500)
download_filesbooleanfalseWhether to include file content in response
emailstringnullEmail address for async download notifications

Download Modes

The API operates in three modes depending on your parameters and data size:

Mode 1: Metadata Only (download_files=false)

Returns metadata about approved records without file content. Use this to query what's available before downloading.

curl --location 'https://us-central.unstract.com/mr/api/<organization_id>/approved/result/<class_id>/?page=1&page_size=50&download_files=false' \
--header 'Authorization: Bearer <api_key>'

Response:

{
"results": [
{
"file": "invoice_001.pdf",
"status": "approved",
"workflow_id": "abc123-...",
"file_execution_id": "exec-456...",
"hitl_queue_name": "my_queue"
}
],
"pagination": {
"total_records": 150,
"page": 1,
"page_size": 50,
"total_pages": 3
}
}

Mode 2: Synchronous Download (download_files=true, small files)

When the total file size is under the threshold (default: 200MB), files are returned directly in the JSON response.

curl --location 'https://us-central.unstract.com/mr/api/<organization_id>/approved/result/<class_id>/?download_files=true&page_size=10' \
--header 'Authorization: Bearer <api_key>'

Response:

{
"results": [
{
"file": "invoice_001.pdf",
"file_content": "<base64_encoded_content>",
"status": "approved",
"result": {
"invoice_number": "INV-001",
"total_amount": "1500.00"
},
"workflow_id": "abc123-...",
"file_execution_id": "exec-456..."
}
],
"pagination": {
"total_records": 10,
"page": 1,
"page_size": 10,
"total_pages": 1
},
"total_size_bytes": 5242880
}

Mode 3: Asynchronous Download (download_files=true, large files)

When the total file size exceeds the threshold, the API creates a background job to prepare a ZIP archive.

curl --location 'https://us-central.unstract.com/mr/api/<organization_id>/approved/result/<class_id>/?download_files=true&page_size=100&email=notify@example.com' \
--header 'Authorization: Bearer <api_key>'

Initial Response (HTTP 202 Accepted):

{
"job_id": "job-789...",
"status": "processing",
"total_files": 100,
"total_size_bytes": 524288000,
"message": "Files are large. Creating zip archive in background. Check status at the provided URL.",
"status_url": "/mr/api/<organization_id>/approved/download-status/job-789.../"
}

Checking Async Download Status

Poll the status endpoint to check if your download is ready:

curl --location 'https://us-central.unstract.com/mr/api/<organization_id>/approved/download-status/<job_id>/' \
--header 'Authorization: Bearer <api_key>'

Response (Processing):

{
"job_id": "job-789...",
"status": "processing",
"total_files": 100,
"processed_files": 45,
"total_size_mb": 500.00,
"progress_percentage": 45.0,
"message": "Download is being prepared. Please check back shortly."
}

Response (Completed):

{
"job_id": "job-789...",
"status": "completed",
"total_files": 100,
"processed_files": 100,
"total_size_mb": 500.00,
"download_url": "/mr/api/<organization_id>/approved/download/job-789.../",
"expires_at": "2024-01-02T12:00:00Z"
}

Response (Failed):

{
"job_id": "job-789...",
"status": "failed",
"total_files": 100,
"processed_files": 50,
"error_message": "Storage quota exceeded"
}

Downloading the ZIP Archive

Once the job is completed, download the ZIP file:

curl --location 'https://us-central.unstract.com/mr/api/<organization_id>/approved/download/<job_id>/' \
--header 'Authorization: Bearer <api_key>' \
--output approved_records.zip

The ZIP archive contains:

  • metadata.json - Complete metadata for all records
  • files/ - Directory containing all the source files
One-Time Download

Each download can only be consumed once. After downloading, the ZIP file is removed from storage. If you need the files again, you'll need to create a new download request.

Email Notifications

For async downloads, you can receive an email notification when the download is ready:

  1. Pass your email in the email parameter
  2. If no email is provided, notifications are sent to the API key owner's email

The notification email includes:

  • Direct download link
  • Number of files included
  • Total file size
  • Expiration time for the download

Best Practices

Pagination

For large queues, use pagination to retrieve records in manageable batches:

# Get first page
curl 'https://.../approved/result/<class_id>/?page=1&page_size=100&download_files=false'

# Get subsequent pages
curl 'https://.../approved/result/<class_id>/?page=2&page_size=100&download_files=false'

Efficient Retrieval

  1. Check metadata first: Use download_files=false to see how many records are available
  2. Use appropriate page sizes: Larger page sizes reduce API calls but increase response time
  3. Monitor async jobs: Poll the status endpoint every 10-30 seconds for async downloads

Error Handling

Status CodeMeaning
200Success - records returned
202Accepted - async job created
400Bad request - invalid parameters
401Unauthorized - invalid or missing API key
404Not found - job or class ID doesn't exist
410Gone - download already consumed
500Server error

Configuration Limits

The following limits are configurable per environment:

SettingDefaultDescription
Max page size500Maximum records per page
Sync download threshold200 MBFiles under this size return synchronously
Max bulk download files1000Maximum files per download request
Download URL expiry24 hoursTime before download link expires
Job retention7 daysHow long completed jobs are kept

Example: Complete Workflow

import requests
import time

BASE_URL = "https://us-central.unstract.com"
ORG_ID = "your_org_id"
CLASS_ID = "your_class_id"
API_KEY = "your_api_key"

headers = {"Authorization": f"Bearer {API_KEY}"}

# Step 1: Check available records
response = requests.get(
f"{BASE_URL}/mr/api/{ORG_ID}/approved/result/{CLASS_ID}/",
params={"download_files": "false", "page_size": 100},
headers=headers
)
data = response.json()
print(f"Total records available: {data['pagination']['total_records']}")

# Step 2: Request download with email notification
response = requests.get(
f"{BASE_URL}/mr/api/{ORG_ID}/approved/result/{CLASS_ID}/",
params={
"download_files": "true",
"page_size": 100,
"email": "your-email@example.com"
},
headers=headers
)

if response.status_code == 202:
# Async download initiated
job_data = response.json()
job_id = job_data["job_id"]
print(f"Async job created: {job_id}")

# Step 3: Poll for completion
while True:
status_response = requests.get(
f"{BASE_URL}/mr/api/{ORG_ID}/approved/download-status/{job_id}/",
headers=headers
)
status_data = status_response.json()

if status_data["status"] == "completed":
download_url = status_data["download_url"]
print(f"Download ready: {download_url}")
break
elif status_data["status"] == "failed":
print(f"Job failed: {status_data.get('error_message')}")
break
else:
progress = status_data.get("progress_percentage", 0)
print(f"Processing: {progress}%")
time.sleep(15)

# Step 4: Download the ZIP file
if status_data["status"] == "completed":
download_response = requests.get(
f"{BASE_URL}{download_url}",
headers=headers
)
with open("approved_records.zip", "wb") as f:
f.write(download_response.content)
print("Download complete!")

elif response.status_code == 200:
# Sync download - files included in response
data = response.json()
print(f"Downloaded {len(data['results'])} records synchronously")