Bulk Download API

The Bulk Download API allows you to retrieve approved records from the HITL queue programmatically. It supports three modes of operation based on your data size and requirements.

Prerequisites

Before using the Bulk Download API, ensure you have:

An API key created for Human Quality Review (see Retrieving Approved Results)
Your organization ID (found in your ETL endpoint URL)
Your class ID (found in Download and Sync Manager)

API Endpoint

GET https://us-central.unstract.com/mr/api/{organization_id}/approved/result/{class_id}/

Authentication

Include your API key in the Authorization header:

Authorization: Bearer <api_key>

Query Parameters

Parameter	Type	Default	Description
`page`	integer	1	Page number for pagination
`page_size`	integer	50	Number of records per page (1-500)
`download_files`	boolean	false	Whether to include file content in response
`email`	string	null	Email address for async download notifications

Download Modes

The API operates in three modes depending on your parameters and data size:

Mode 1: Metadata Only (`download_files=false`)

Returns metadata about approved records without file content. Use this to query what's available before downloading.

curl --location 'https://us-central.unstract.com/mr/api/<organization_id>/approved/result/<class_id>/?page=1&page_size=50&download_files=false' \
--header 'Authorization: Bearer <api_key>'

Response:

{
  "results": [
    {
      "file": "invoice_001.pdf",
      "status": "approved",
      "workflow_id": "abc123-...",
      "file_execution_id": "exec-456...",
      "hitl_queue_name": "my_queue"
    }
  ],
  "pagination": {
    "total_records": 150,
    "page": 1,
    "page_size": 50,
    "total_pages": 3
  }
}

Mode 2: Synchronous Download (`download_files=true`, small files)

When the total file size is under the threshold (default: 200MB), files are returned directly in the JSON response.

curl --location 'https://us-central.unstract.com/mr/api/<organization_id>/approved/result/<class_id>/?download_files=true&page_size=10' \
--header 'Authorization: Bearer <api_key>'

Response:

{
  "results": [
    {
      "file": "invoice_001.pdf",
      "file_content": "<base64_encoded_content>",
      "status": "approved",
      "result": {
        "invoice_number": "INV-001",
        "total_amount": "1500.00"
      },
      "workflow_id": "abc123-...",
      "file_execution_id": "exec-456..."
    }
  ],
  "pagination": {
    "total_records": 10,
    "page": 1,
    "page_size": 10,
    "total_pages": 1
  },
  "total_size_bytes": 5242880
}

Mode 3: Asynchronous Download (`download_files=true`, large files)

When the total file size exceeds the threshold, the API creates a background job to prepare a ZIP archive.

curl --location 'https://us-central.unstract.com/mr/api/<organization_id>/approved/result/<class_id>/?download_files=true&page_size=100&email=notify@example.com' \
--header 'Authorization: Bearer <api_key>'

Initial Response (HTTP 202 Accepted):

{
  "job_id": "job-789...",
  "status": "processing",
  "total_files": 100,
  "total_size_bytes": 524288000,
  "message": "Files are large. Creating zip archive in background. Check status at the provided URL.",
  "status_url": "/mr/api/<organization_id>/approved/download-status/job-789.../"
}

Checking Async Download Status

Poll the status endpoint to check if your download is ready:

curl --location 'https://us-central.unstract.com/mr/api/<organization_id>/approved/download-status/<job_id>/' \
--header 'Authorization: Bearer <api_key>'

Response (Processing):

{
  "job_id": "job-789...",
  "status": "processing",
  "total_files": 100,
  "processed_files": 45,
  "total_size_mb": 500.00,
  "progress_percentage": 45.0,
  "message": "Download is being prepared. Please check back shortly."
}

Response (Completed):

{
  "job_id": "job-789...",
  "status": "completed",
  "total_files": 100,
  "processed_files": 100,
  "total_size_mb": 500.00,
  "download_url": "/mr/api/<organization_id>/approved/download/job-789.../",
  "expires_at": "2024-01-02T12:00:00Z"
}

Response (Failed):

{
  "job_id": "job-789...",
  "status": "failed",
  "total_files": 100,
  "processed_files": 50,
  "error_message": "Storage quota exceeded"
}

Downloading the ZIP Archive

Once the job is completed, download the ZIP file:

curl --location 'https://us-central.unstract.com/mr/api/<organization_id>/approved/download/<job_id>/' \
--header 'Authorization: Bearer <api_key>' \
--output approved_records.zip

The ZIP archive contains:

metadata.json - Complete metadata for all records
files/ - Directory containing all the source files

One-Time Download

Each download can only be consumed once. After downloading, the ZIP file is removed from storage. If you need the files again, you'll need to create a new download request.

Email Notifications

For async downloads, you can receive an email notification when the download is ready:

Pass your email in the email parameter
If no email is provided, notifications are sent to the API key owner's email

The notification email includes:

Direct download link
Number of files included
Total file size
Expiration time for the download

Best Practices

Pagination

For large queues, use pagination to retrieve records in manageable batches:

# Get first page
curl 'https://.../approved/result/<class_id>/?page=1&page_size=100&download_files=false'

# Get subsequent pages
curl 'https://.../approved/result/<class_id>/?page=2&page_size=100&download_files=false'

Efficient Retrieval

Check metadata first: Use download_files=false to see how many records are available
Use appropriate page sizes: Larger page sizes reduce API calls but increase response time
Monitor async jobs: Poll the status endpoint every 10-30 seconds for async downloads

Error Handling

Status Code	Meaning
200	Success - records returned
202	Accepted - async job created
400	Bad request - invalid parameters
401	Unauthorized - invalid or missing API key
404	Not found - job or class ID doesn't exist
410	Gone - download already consumed
500	Server error

Configuration Limits

The following limits are configurable per environment:

Setting	Default	Description
Max page size	500	Maximum records per page
Sync download threshold	200 MB	Files under this size return synchronously
Max bulk download files	1000	Maximum files per download request
Download URL expiry	24 hours	Time before download link expires
Job retention	7 days	How long completed jobs are kept

Example: Complete Workflow

import requests
import time

BASE_URL = "https://us-central.unstract.com"
ORG_ID = "your_org_id"
CLASS_ID = "your_class_id"
API_KEY = "your_api_key"

headers = {"Authorization": f"Bearer {API_KEY}"}

# Step 1: Check available records
response = requests.get(
    f"{BASE_URL}/mr/api/{ORG_ID}/approved/result/{CLASS_ID}/",
    params={"download_files": "false", "page_size": 100},
    headers=headers
)
data = response.json()
print(f"Total records available: {data['pagination']['total_records']}")

# Step 2: Request download with email notification
response = requests.get(
    f"{BASE_URL}/mr/api/{ORG_ID}/approved/result/{CLASS_ID}/",
    params={
        "download_files": "true",
        "page_size": 100,
        "email": "your-email@example.com"
    },
    headers=headers
)

if response.status_code == 202:
    # Async download initiated
    job_data = response.json()
    job_id = job_data["job_id"]
    print(f"Async job created: {job_id}")

    # Step 3: Poll for completion
    while True:
        status_response = requests.get(
            f"{BASE_URL}/mr/api/{ORG_ID}/approved/download-status/{job_id}/",
            headers=headers
        )
        status_data = status_response.json()

        if status_data["status"] == "completed":
            download_url = status_data["download_url"]
            print(f"Download ready: {download_url}")
            break
        elif status_data["status"] == "failed":
            print(f"Job failed: {status_data.get('error_message')}")
            break
        else:
            progress = status_data.get("progress_percentage", 0)
            print(f"Processing: {progress}%")
            time.sleep(15)

    # Step 4: Download the ZIP file
    if status_data["status"] == "completed":
        download_response = requests.get(
            f"{BASE_URL}{download_url}",
            headers=headers
        )
        with open("approved_records.zip", "wb") as f:
            f.write(download_response.content)
        print("Download complete!")

elif response.status_code == 200:
    # Sync download - files included in response
    data = response.json()
    print(f"Downloaded {len(data['results'])} records synchronously")

Prerequisites​

API Endpoint​

Authentication​

Query Parameters​

Download Modes​

Mode 1: Metadata Only (download_files=false)​

Mode 2: Synchronous Download (download_files=true, small files)​

Mode 3: Asynchronous Download (download_files=true, large files)​

Checking Async Download Status​

Downloading the ZIP Archive​

Email Notifications​

Best Practices​

Pagination​

Efficient Retrieval​

Error Handling​

Configuration Limits​

Example: Complete Workflow​