On-Prem Disaster Recovery SOP

This document provides disaster recovery procedures for Unstract and LLMWhisperer on-premise deployments. It covers recovery from AZ or region-wide failures using a cold standby approach. Cold standby is recommended since Unstract is usually business critical but not mission critical.

Objective	Target
Recovery Time Objective (RTO)	6 hours
Recovery Point Objective (RPO)	24 hours

1. Scope

Unstract platform recovery
LLMWhisperer service recovery
Configuration and data restoration
Application-specific validation

Customer Responsibilities

Infrastructure provisioning in DR region
Database backup/restore operations
Network, SSL and DNS configuration
Cloud Object Storage replication setup (required for Prompt Studio sample files to be restored)

Data Loss Considerations

Component	Impact of 24-hour RPO
Processed documents	May need re-processing
Workflows in progress	Will remain in limbo state, can be retriggered
Configuration changes	Reverted to last backup
LLMWhisperer APIs	In-flight requests lost, must be retried
Usage data	Potential mismatch with vendor records

2. Prerequisites

Documentation Requirements

Current values.yaml and secret.yaml files (version controlled)
artifact-key.json for image registry access
Deployment documentation reference
DR region infrastructure specifications

Infrastructure Prerequisites

Customer must have the ability to provision within 4 hours:

Recreate the same infrastructure required as per the deployment docs
Restore the backed up database to a new instance, or use a cross-region replica if available
For cloud object storage, use the replicated storage

Backup Requirements

Database automated backups (daily minimum)
Cloud Object Storage cross-region replication active
Configuration files in version control or another secure location

3. Backup Procedures

Automated Backups (Customer Managed)

Databases: Daily automated snapshots with 7-day retention
Cloud Object Storage: Real-time cross-region replication
Monitoring: Backup success/failure alerts

Kubernetes Resource Backup (Optional)

Kubernetes resources do not need to be backed up as they can be restored using Helm. However, if you have directly made resource changes in Kubernetes or added additional resources, you may choose to back them up.

4. Recovery Procedures

Phase 1: Environment Preparation (2 hours)

Infrastructure Provisioning

Customer Action Required — Provision the following in the DR region:

Kubernetes cluster matching production specs
PostgreSQL instances (2 databases)
Cloud Storage buckets with replication
Load balancers and networking
DNS entries (can be updated later)

Database Restoration

Customer Action Required:

Restore both PostgreSQL databases from latest backup
Verify connectivity and update endpoints

Phase 2: Configuration Update (30 minutes)

Update your values.yaml and secret.yaml files with:

New database endpoints
New storage bucket names/endpoints
New ingress domains
Any region-specific configurations

danger

Ensure ENCRYPTION_KEY in secret.yaml remains unchanged to decrypt existing data.

Phase 3: Application Deployment (1 hour)

Deploy applications following the standard deployment procedures:

Deploy Unstract: Follow the Deployment Guide — Installation section
Deploy LLMWhisperer: Follow the LLMWhisperer deployment guide

Use the same Helm commands and values files as documented in your original deployment, with the updated DR configuration files.

Phase 4: Network Configuration (Customer Managed)

Update DNS to point to DR region
Configure SSL certificates
Verify ingress timeout settings (900 seconds)

5. Post-Recovery Validation

System Health Checks

Ensure all health checks at pod level (and at ingress level, if configured) are passing.

kubectl get pods -n unstract

All pods should be in Running state with zero restarts.

Functional Validation

Unstract Platform:
- Access web interface
- Login
- Create test workflow / View existing workflows
- Upload and process test document
- Verify API deployments
LLMWhisperer:
- Access dashboard
- Test extract endpoint
- Verify usage tracking

Known Issues Post-Recovery

In-flight workflow executions could remain as in-progress in the DB record at the time of the last snapshot. Anything after that will be lost.
LLMWhisperer in-flight requests are lost
Usage data may be inconsistent with vendor records
Configuration changes made in the last 24 hours (last backup point) to adapters, Prompt Studio, workflows, etc. may be lost

6. Testing Guidelines

Testing Schedule

Frequency: Semi-annually
Type: Table-top exercise or partial test

Test Procedure

Document review and update
Verify backup availability
Test Kubernetes resource restoration
Validate configuration files
Deploy to test namespace (optional)

Success Criteria

All pods running without crashes
API endpoints responding
Test document processed successfully
No critical errors in logs

Appendix

A. Quick Reference Commands

# Check deployment status
helm list -n unstract
helm list -n llmwhisperer

# View current configuration
helm get values unstract-platform -n unstract > current-unstract-values.yaml
helm get values whisperer -n llmwhisperer > current-llmwhisperer-values.yaml

# Debug pod issues
kubectl describe pod [POD_NAME] -n [NAMESPACE]
kubectl logs [POD_NAME] -n [NAMESPACE] --previous

B. Troubleshooting

Image Pull Errors
- Verify gcr-artifact-secret exists in namespace
- Check artifact-key.json validity
- Ensure Helm registry login succeeded
Database Connection Failed
- Verify database endpoints in secret.yaml
- Check network connectivity
- Validate credentials
Storage Access Denied
- Verify IAM policies match production
- Check bucket names in values.yaml
- Validate credentials in secret.yaml
Pods in CrashLoopBackOff
- Check pod logs for specific errors
- Verify all secrets are present
- Ensure resource limits are adequate

C. Contact Information

Support Type	Contact	When to Use
Unstract Support	support@unstract.com	Application-specific issues
Unstract Docs	On-prem Edition Docs	Unstract setup reference
LLMWhisperer Docs	On-prem Edition Docs	LLMWhisperer setup reference

1. Scope​

Customer Responsibilities​

Data Loss Considerations​

2. Prerequisites​

Documentation Requirements​

Infrastructure Prerequisites​

Backup Requirements​

3. Backup Procedures​

Automated Backups (Customer Managed)​

Kubernetes Resource Backup (Optional)​

4. Recovery Procedures​

Phase 1: Environment Preparation (2 hours)​

Infrastructure Provisioning​

Database Restoration​

Phase 2: Configuration Update (30 minutes)​

Phase 3: Application Deployment (1 hour)​

Phase 4: Network Configuration (Customer Managed)​

5. Post-Recovery Validation​

System Health Checks​

Functional Validation​

Known Issues Post-Recovery​

6. Testing Guidelines​

Testing Schedule​

Test Procedure​

Success Criteria​

Appendix​

A. Quick Reference Commands​

B. Troubleshooting​

C. Contact Information​