High Availability
Deploy DB Audit with enterprise-grade availability. Redundant collectors, automatic failover, and disaster recovery ensure you never miss a security event.
Active-Active Collectors
Deploy multiple collectors for the same databases. Events are deduplicated automatically.
Automatic Failover
If a collector fails, others continue capturing events with no data loss.
Local Event Cache
Events are cached locally during network outages and synced when connectivity resumes.
Health Monitoring
Built-in health endpoints and metrics for integration with monitoring systems.
Kubernetes HA Deployment
Deploy multiple collector replicas with anti-affinity rules to spread across availability zones. This ensures continued operation even if an entire zone fails.
# High-availability collector deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: dbaudit-collector
labels:
app: dbaudit-collector
spec:
replicas: 3
selector:
matchLabels:
app: dbaudit-collector
template:
metadata:
labels:
app: dbaudit-collector
spec:
affinity:
# Spread across availability zones
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: dbaudit-collector
topologyKey: topology.kubernetes.io/zone
containers:
- name: collector
image: dbaudit/collector:latest
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 2000m
memory: 2Gi
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
volumeMounts:
- name: cache
mountPath: /var/lib/dbaudit/cache
- name: config
mountPath: /etc/dbaudit
volumes:
- name: cache
persistentVolumeClaim:
claimName: dbaudit-cache
- name: config
secret:
secretName: dbaudit-config Local Event Cache
The collector maintains a local cache of events to ensure zero data loss during network outages or API unavailability. Events are automatically synced when connectivity resumes.
Network Resilience
Events cached locally during API outages are synced when connectivity returns.
Configurable Size
Set maximum cache size and retention to match your disk capacity.
Encrypted Storage
Cached events are encrypted at rest using AES-256-GCM.
# config.yaml - Local cache for resilience
collector:
cache:
enabled: true
path: /var/lib/dbaudit/cache
# Maximum cache size before oldest events are dropped
max_size: 10GB
# How long to retain cached events
max_age: 24h
# Sync behavior during network recovery
sync:
# Events per second to replay after reconnection
rate_limit: 10000
# Prioritize recent events during sync
order: newest_first
# Retry configuration for API failures
retry:
max_attempts: 10
initial_delay: 1s
max_delay: 5m
backoff_multiplier: 2 Health Monitoring
The collector exposes health endpoints for integration with Kubernetes, load balancers, and monitoring systems.
# Health check endpoints
GET /health/live
# Returns 200 if the collector process is running
# Use for Kubernetes liveness probe
GET /health/ready
# Returns 200 if the collector can accept traffic
# Checks: database connectivity, API connectivity, cache availability
# Use for Kubernetes readiness probe
GET /health/detailed
# Returns detailed health status
{
"status": "healthy",
"version": "2.4.1",
"uptime_seconds": 86400,
"checks": {
"api_connection": {
"status": "healthy",
"latency_ms": 45
},
"database_connections": {
"status": "healthy",
"connected": 5,
"total": 5
},
"cache": {
"status": "healthy",
"size_bytes": 1073741824,
"events_cached": 50000
},
"event_buffer": {
"status": "healthy",
"utilization_percent": 23
}
}
} Prometheus Metrics
Export metrics in Prometheus format for integration with Grafana, Datadog, or other monitoring platforms.
# Prometheus metrics endpoint
GET /metrics
# Key metrics for monitoring
dbaudit_events_processed_total{database="prod",status="success"} 1542890
dbaudit_events_processed_total{database="prod",status="error"} 12
dbaudit_event_latency_seconds_bucket{le="0.001"} 1500000
dbaudit_event_latency_seconds_bucket{le="0.01"} 1542000
dbaudit_event_latency_seconds_bucket{le="0.1"} 1542890
dbaudit_cache_size_bytes 1073741824
dbaudit_cache_events_count 50000
dbaudit_cache_sync_pending 0
dbaudit_api_requests_total{endpoint="events",status="200"} 98234
dbaudit_api_requests_total{endpoint="events",status="429"} 12
dbaudit_api_latency_seconds_sum 4521.23
dbaudit_database_connections_active 5
dbaudit_database_connections_errors_total 3 Disaster Recovery
For mission-critical deployments, configure multi-region disaster recovery to ensure availability even during regional outages.
Active-Passive
< 15 minutes
< 1 minute
Primary region handles all traffic. DR region receives replicated data and can be activated on demand.
Advantages
- Lower cost
- Simple architecture
- Clear failover process
Considerations
- Manual failover required
- DR region resources may be stale
Active-Active
< 1 minute
< 10 seconds
Both regions actively process events. Global load balancing routes to nearest region.
Advantages
- Automatic failover
- Better latency globally
- No idle resources
Considerations
- Higher cost
- Complex data consistency
- Requires multi-region setup
Multi-Region Architecture
# Multi-region deployment architecture
┌─────────────────────────────────────────────────────────────┐
│ US-EAST-1 (Primary) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Collector 1 │ │ Collector 2 │ │ Collector 3 │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ └────────────────┼────────────────┘ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ DB Audit Cloud US │ ◄─────┐ │
│ └───────────────────────┘ │ │
└──────────────────────────────────────────────┼──────────────┘
│ Cross-region
│ replication
┌──────────────────────────────────────────────┼──────────────┐
│ EU-WEST-1 (DR) │ │
│ ┌───────────────────────┐ │ │
│ │ DB Audit Cloud EU │ ◄─────┘ │
│ └───────────────────────┘ │
│ ▲ │
│ ┌────────────────┼────────────────┐ │
│ ┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐ │
│ │ Collector 4 │ │ Collector 5 │ │ Collector 6 │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘ HA Monitoring Alerts
Configure alerts to detect issues with your HA deployment before they impact security monitoring.
# Alerting for HA monitoring
alerts:
- name: collector-down
condition: absent(up{job="dbaudit-collector"})
for: 5m
severity: critical
annotations:
summary: "DB Audit collector is down"
description: "Collector {{ $labels.instance }} has been down for 5 minutes"
- name: high-cache-utilization
condition: dbaudit_cache_size_bytes / dbaudit_cache_max_bytes > 0.9
for: 10m
severity: warning
annotations:
summary: "DB Audit cache nearly full"
description: "Cache utilization is {{ $value | humanizePercentage }}"
- name: event-lag
condition: dbaudit_event_lag_seconds > 60
for: 5m
severity: warning
annotations:
summary: "DB Audit event processing lag"
description: "Events are {{ $value }}s behind real-time"
- name: api-errors
condition: rate(dbaudit_api_requests_total{status=~"5.."}[5m]) > 0.1
for: 5m
severity: warning
annotations:
summary: "DB Audit API errors"
description: "API error rate is {{ $value | humanizePercentage }}" HA Best Practices
Three replicas allow for maintenance and unexpected failures while maintaining quorum.
Use pod anti-affinity to ensure collectors run in different AZs.
Ensure cached events survive pod restarts by using PersistentVolumeClaims.
Set alerts for high cache usage to detect connectivity issues early.
Perform chaos engineering exercises to validate HA configuration works as expected.