Skip to main content

On-Prem Disaster Recovery SOP

This document provides disaster recovery procedures for Unstract and LLMWhisperer on-premise deployments. It covers recovery from AZ or region-wide failures using a cold standby approach. Cold standby is recommended since Unstract is usually business critical but not mission critical.

ObjectiveTarget
Recovery Time Objective (RTO)6 hours
Recovery Point Objective (RPO)24 hours

1. Scope

  • Unstract platform recovery
  • LLMWhisperer service recovery
  • Configuration and data restoration
  • Application-specific validation

Customer Responsibilities

  • Infrastructure provisioning in DR region
  • Database backup/restore operations
  • Network, SSL and DNS configuration
  • Cloud Object Storage replication setup (required for Prompt Studio sample files to be restored)

Data Loss Considerations

ComponentImpact of 24-hour RPO
Processed documentsMay need re-processing
Workflows in progressWill remain in limbo state, can be retriggered
Configuration changesReverted to last backup
LLMWhisperer APIsIn-flight requests lost, must be retried
Usage dataPotential mismatch with vendor records

2. Prerequisites

Documentation Requirements

  • Current values.yaml and secret.yaml files (version controlled)
  • artifact-key.json for image registry access
  • Deployment documentation reference
  • DR region infrastructure specifications

Infrastructure Prerequisites

Customer must have the ability to provision within 4 hours:

  • Recreate the same infrastructure required as per the deployment docs
  • Restore the backed up database to a new instance, or use a cross-region replica if available
  • For cloud object storage, use the replicated storage

Backup Requirements

  • Database automated backups (daily minimum)
  • Cloud Object Storage cross-region replication active
  • Configuration files in version control or another secure location

3. Backup Procedures

Automated Backups (Customer Managed)

  • Databases: Daily automated snapshots with 7-day retention
  • Cloud Object Storage: Real-time cross-region replication
  • Monitoring: Backup success/failure alerts

Kubernetes Resource Backup (Optional)

Kubernetes resources do not need to be backed up as they can be restored using Helm. However, if you have directly made resource changes in Kubernetes or added additional resources, you may choose to back them up.

4. Recovery Procedures

Phase 1: Environment Preparation (2 hours)

Infrastructure Provisioning

Customer Action Required — Provision the following in the DR region:

  • Kubernetes cluster matching production specs
  • PostgreSQL instances (2 databases)
  • Cloud Storage buckets with replication
  • Load balancers and networking
  • DNS entries (can be updated later)

Database Restoration

Customer Action Required:

  • Restore both PostgreSQL databases from latest backup
  • Verify connectivity and update endpoints

Phase 2: Configuration Update (30 minutes)

Update your values.yaml and secret.yaml files with:

  • New database endpoints
  • New storage bucket names/endpoints
  • New ingress domains
  • Any region-specific configurations
danger

Ensure ENCRYPTION_KEY in secret.yaml remains unchanged to decrypt existing data.

Phase 3: Application Deployment (1 hour)

Deploy applications following the standard deployment procedures:

  1. Deploy Unstract: Follow the Deployment Guide — Installation section
  2. Deploy LLMWhisperer: Follow the LLMWhisperer deployment guide

Use the same Helm commands and values files as documented in your original deployment, with the updated DR configuration files.

Phase 4: Network Configuration (Customer Managed)

  • Update DNS to point to DR region
  • Configure SSL certificates
  • Verify ingress timeout settings (900 seconds)

5. Post-Recovery Validation

System Health Checks

Ensure all health checks at pod level (and at ingress level, if configured) are passing.

kubectl get pods -n unstract

All pods should be in Running state with zero restarts.

Functional Validation

  1. Unstract Platform:

    • Access web interface
    • Login
    • Create test workflow / View existing workflows
    • Upload and process test document
    • Verify API deployments
  2. LLMWhisperer:

    • Access dashboard
    • Test extract endpoint
    • Verify usage tracking

Known Issues Post-Recovery

  • In-flight workflow executions could remain as in-progress in the DB record at the time of the last snapshot. Anything after that will be lost.
  • LLMWhisperer in-flight requests are lost
  • Usage data may be inconsistent with vendor records
  • Configuration changes made in the last 24 hours (last backup point) to adapters, Prompt Studio, workflows, etc. may be lost

6. Testing Guidelines

Testing Schedule

  • Frequency: Semi-annually
  • Type: Table-top exercise or partial test

Test Procedure

  1. Document review and update
  2. Verify backup availability
  3. Test Kubernetes resource restoration
  4. Validate configuration files
  5. Deploy to test namespace (optional)

Success Criteria

  • All pods running without crashes
  • API endpoints responding
  • Test document processed successfully
  • No critical errors in logs

Appendix

A. Quick Reference Commands

# Check deployment status
helm list -n unstract
helm list -n llmwhisperer

# View current configuration
helm get values unstract-platform -n unstract > current-unstract-values.yaml
helm get values whisperer -n llmwhisperer > current-llmwhisperer-values.yaml

# Debug pod issues
kubectl describe pod [POD_NAME] -n [NAMESPACE]
kubectl logs [POD_NAME] -n [NAMESPACE] --previous

B. Troubleshooting

  1. Image Pull Errors

    • Verify gcr-artifact-secret exists in namespace
    • Check artifact-key.json validity
    • Ensure Helm registry login succeeded
  2. Database Connection Failed

    • Verify database endpoints in secret.yaml
    • Check network connectivity
    • Validate credentials
  3. Storage Access Denied

    • Verify IAM policies match production
    • Check bucket names in values.yaml
    • Validate credentials in secret.yaml
  4. Pods in CrashLoopBackOff

    • Check pod logs for specific errors
    • Verify all secrets are present
    • Ensure resource limits are adequate

C. Contact Information

Support TypeContactWhen to Use
Unstract Supportsupport@unstract.comApplication-specific issues
Unstract DocsOn-prem Edition DocsUnstract setup reference
LLMWhisperer DocsOn-prem Edition DocsLLMWhisperer setup reference