Operations

High Availability

Deploy DB Audit with enterprise-grade availability. Redundant collectors, automatic failover, and disaster recovery ensure you never miss a security event.

Active-Active Collectors

Deploy multiple collectors for the same databases. Events are deduplicated automatically.

Automatic Failover

If a collector fails, others continue capturing events with no data loss.

Local Event Cache

Events are cached locally during network outages and synced when connectivity resumes.

Health Monitoring

Built-in health endpoints and metrics for integration with monitoring systems.

Kubernetes HA Deployment

Deploy multiple collector replicas with anti-affinity rules to spread across availability zones. This ensures continued operation even if an entire zone fails.

# High-availability collector deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dbaudit-collector
  labels:
    app: dbaudit-collector
spec:
  replicas: 3
  selector:
    matchLabels:
      app: dbaudit-collector
  template:
    metadata:
      labels:
        app: dbaudit-collector
    spec:
      affinity:
        # Spread across availability zones
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: dbaudit-collector
                topologyKey: topology.kubernetes.io/zone
      containers:
        - name: collector
          image: dbaudit/collector:latest
          resources:
            requests:
              cpu: 500m
              memory: 512Mi
            limits:
              cpu: 2000m
              memory: 2Gi
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
          volumeMounts:
            - name: cache
              mountPath: /var/lib/dbaudit/cache
            - name: config
              mountPath: /etc/dbaudit
      volumes:
        - name: cache
          persistentVolumeClaim:
            claimName: dbaudit-cache
        - name: config
          secret:
            secretName: dbaudit-config

Local Event Cache

The collector maintains a local cache of events to ensure zero data loss during network outages or API unavailability. Events are automatically synced when connectivity resumes.

Network Resilience

Events cached locally during API outages are synced when connectivity returns.

Configurable Size

Set maximum cache size and retention to match your disk capacity.

Encrypted Storage

Cached events are encrypted at rest using AES-256-GCM.

# config.yaml - Local cache for resilience
collector:
  cache:
    enabled: true
    path: /var/lib/dbaudit/cache

    # Maximum cache size before oldest events are dropped
    max_size: 10GB

    # How long to retain cached events
    max_age: 24h

    # Sync behavior during network recovery
    sync:
      # Events per second to replay after reconnection
      rate_limit: 10000

      # Prioritize recent events during sync
      order: newest_first

  # Retry configuration for API failures
  retry:
    max_attempts: 10
    initial_delay: 1s
    max_delay: 5m
    backoff_multiplier: 2

Health Monitoring

The collector exposes health endpoints for integration with Kubernetes, load balancers, and monitoring systems.

# Health check endpoints
GET /health/live
# Returns 200 if the collector process is running
# Use for Kubernetes liveness probe

GET /health/ready
# Returns 200 if the collector can accept traffic
# Checks: database connectivity, API connectivity, cache availability
# Use for Kubernetes readiness probe

GET /health/detailed
# Returns detailed health status
{
  "status": "healthy",
  "version": "2.4.1",
  "uptime_seconds": 86400,
  "checks": {
    "api_connection": {
      "status": "healthy",
      "latency_ms": 45
    },
    "database_connections": {
      "status": "healthy",
      "connected": 5,
      "total": 5
    },
    "cache": {
      "status": "healthy",
      "size_bytes": 1073741824,
      "events_cached": 50000
    },
    "event_buffer": {
      "status": "healthy",
      "utilization_percent": 23
    }
  }
}

Prometheus Metrics

Export metrics in Prometheus format for integration with Grafana, Datadog, or other monitoring platforms.

# Prometheus metrics endpoint
GET /metrics

# Key metrics for monitoring
dbaudit_events_processed_total{database="prod",status="success"} 1542890
dbaudit_events_processed_total{database="prod",status="error"} 12

dbaudit_event_latency_seconds_bucket{le="0.001"} 1500000
dbaudit_event_latency_seconds_bucket{le="0.01"} 1542000
dbaudit_event_latency_seconds_bucket{le="0.1"} 1542890

dbaudit_cache_size_bytes 1073741824
dbaudit_cache_events_count 50000
dbaudit_cache_sync_pending 0

dbaudit_api_requests_total{endpoint="events",status="200"} 98234
dbaudit_api_requests_total{endpoint="events",status="429"} 12
dbaudit_api_latency_seconds_sum 4521.23

dbaudit_database_connections_active 5
dbaudit_database_connections_errors_total 3

Disaster Recovery

For mission-critical deployments, configure multi-region disaster recovery to ensure availability even during regional outages.

Active-Passive

RTO

< 15 minutes

RPO

< 1 minute

Primary region handles all traffic. DR region receives replicated data and can be activated on demand.

Advantages

  • Lower cost
  • Simple architecture
  • Clear failover process

Considerations

  • Manual failover required
  • DR region resources may be stale

Active-Active

RTO

< 1 minute

RPO

< 10 seconds

Both regions actively process events. Global load balancing routes to nearest region.

Advantages

  • Automatic failover
  • Better latency globally
  • No idle resources

Considerations

  • Higher cost
  • Complex data consistency
  • Requires multi-region setup

Multi-Region Architecture

# Multi-region deployment architecture
┌─────────────────────────────────────────────────────────────┐
│                     US-EAST-1 (Primary)                     │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │ Collector 1 │  │ Collector 2 │  │ Collector 3 │         │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘         │
│         └────────────────┼────────────────┘                 │
│                          ▼                                  │
│              ┌───────────────────────┐                      │
│              │   DB Audit Cloud US   │ ◄─────┐              │
│              └───────────────────────┘       │              │
└──────────────────────────────────────────────┼──────────────┘
                                               │ Cross-region
                                               │ replication
┌──────────────────────────────────────────────┼──────────────┐
│                     EU-WEST-1 (DR)           │              │
│              ┌───────────────────────┐       │              │
│              │   DB Audit Cloud EU   │ ◄─────┘              │
│              └───────────────────────┘                      │
│                          ▲                                  │
│         ┌────────────────┼────────────────┐                 │
│  ┌──────┴──────┐  ┌──────┴──────┐  ┌──────┴──────┐         │
│  │ Collector 4 │  │ Collector 5 │  │ Collector 6 │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
└─────────────────────────────────────────────────────────────┘

HA Monitoring Alerts

Configure alerts to detect issues with your HA deployment before they impact security monitoring.

# Alerting for HA monitoring
alerts:
  - name: collector-down
    condition: absent(up{job="dbaudit-collector"})
    for: 5m
    severity: critical
    annotations:
      summary: "DB Audit collector is down"
      description: "Collector {{ $labels.instance }} has been down for 5 minutes"

  - name: high-cache-utilization
    condition: dbaudit_cache_size_bytes / dbaudit_cache_max_bytes > 0.9
    for: 10m
    severity: warning
    annotations:
      summary: "DB Audit cache nearly full"
      description: "Cache utilization is {{ $value | humanizePercentage }}"

  - name: event-lag
    condition: dbaudit_event_lag_seconds > 60
    for: 5m
    severity: warning
    annotations:
      summary: "DB Audit event processing lag"
      description: "Events are {{ $value }}s behind real-time"

  - name: api-errors
    condition: rate(dbaudit_api_requests_total{status=~"5.."}[5m]) > 0.1
    for: 5m
    severity: warning
    annotations:
      summary: "DB Audit API errors"
      description: "API error rate is {{ $value | humanizePercentage }}"

HA Best Practices

1
Deploy at least 3 collector replicas

Three replicas allow for maintenance and unexpected failures while maintaining quorum.

2
Spread across availability zones

Use pod anti-affinity to ensure collectors run in different AZs.

3
Use persistent volumes for cache

Ensure cached events survive pod restarts by using PersistentVolumeClaims.

4
Monitor cache utilization

Set alerts for high cache usage to detect connectivity issues early.

5
Test failover regularly

Perform chaos engineering exercises to validate HA configuration works as expected.