Data Classification

Automatic Data Classification

Discover and classify sensitive data across all your databases. Identify PII, financial data, health records, and credentials automatically with intelligent pattern matching and machine learning.

Know What Data You Have

You can't protect what you don't know about. DB Audit's classification engine automatically scans your databases to discover and categorize sensitive data, helping you understand your data landscape and meet compliance requirements.

200+
Built-in data patterns
99.2%
Classification accuracy
Minutes
Time to full scan

What We Detect

Classification Levels

Data is automatically assigned to sensitivity levels based on its content and context. Use these levels to enforce appropriate access controls and audit requirements.

Public

Data that can be freely shared without risk

Internal

Business data for internal use only

Confidential

Sensitive data requiring access controls

Restricted

Highly sensitive data with strict access limits

How Classification Works

1

Schema Discovery

DB Audit connects to your databases and catalogs all tables, columns, and their data types. This metadata is used to prioritize scanning.

2

Data Sampling

A representative sample of data is analyzed using configurable sampling strategies. Full scans are available for smaller datasets or compliance requirements.

3

Pattern Matching

Over 200 built-in patterns detect common sensitive data formats. Custom patterns can be added for organization-specific data.

4

ML Enhancement

Machine learning models improve classification accuracy by understanding context and detecting patterns that regex alone would miss.

5

Classification Assignment

Each column is assigned a data category and sensitivity level. Results are stored and used to automatically apply protection policies.

Configuration

Configure classification rules, custom patterns, and scanning schedules through YAML configuration or the dashboard.

# Data Classification Configuration
classification:
  enabled: true
  scan_schedule: "0 2 * * *"  # Daily at 2 AM

  # Built-in classifiers
  classifiers:
    - type: pii
      enabled: true
      sensitivity: high
      patterns:
        - name: ssn
          regex: '\b\d{3}-\d{2}-\d{4}\b'
          classification: restricted
        - name: email
          regex: '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
          classification: confidential

    - type: financial
      enabled: true
      sensitivity: high
      patterns:
        - name: credit_card
          regex: '\b(?:\d{4}[- ]?){3}\d{4}\b'
          classification: restricted
          validate: luhn  # Validate using Luhn algorithm

    - type: phi
      enabled: true
      sensitivity: high
      hipaa_compliant: true

    - type: credentials
      enabled: true
      sensitivity: critical
      alert_on_detection: true

  # Custom classifiers
  custom_classifiers:
    - name: internal_ids
      description: "Internal employee and project IDs"
      patterns:
        - regex: 'EMP-\d{6}'
          classification: internal
        - regex: 'PROJ-[A-Z]{3}-\d{4}'
          classification: confidential

  # Exclusions
  exclusions:
    tables:
      - audit_log
      - system_metadata
    columns:
      - created_at
      - updated_at

Scan Results

Run classification scans on-demand or on a schedule. Results are available in the dashboard, CLI, and as exportable reports.

# Classification Scan Results
dbaudit classify scan --database production

Scanning database: production
Tables scanned: 47
Columns analyzed: 312
Rows sampled: 1,000,000

Classification Results:
=======================

Table: customers
  - email (VARCHAR)        -> PII (Confidential)
  - phone (VARCHAR)        -> PII (Confidential)
  - ssn (VARCHAR)          -> PII (Restricted) [!]
  - credit_card (VARCHAR)  -> Financial (Restricted) [!]

Table: employees
  - full_name (VARCHAR)    -> PII (Confidential)
  - salary (DECIMAL)       -> Financial (Confidential)
  - employee_id (VARCHAR)  -> Internal

Table: medical_records
  - diagnosis (TEXT)       -> PHI (Restricted) [!]
  - treatment (TEXT)       -> PHI (Restricted) [!]
  - insurance_id (VARCHAR) -> PHI (Confidential)

[!] Restricted data detected - Review recommended

Summary:
  Public: 156 columns
  Internal: 89 columns
  Confidential: 52 columns
  Restricted: 15 columns

Report saved to: ./classification-report-2024-01-15.json
Automatic Policy Integration

Classification results automatically integrate with security policies. When sensitive data is discovered, you can automatically apply masking, access controls, and audit logging.

Policy-Based Actions

Data Masking

Automatically mask classified data in query results based on user roles.

Access Controls

Restrict access to classified columns based on sensitivity levels.

Audit Logging

Enhanced logging for all access to classified data for compliance.

Alerting

Real-time alerts when classified data is accessed or exported.

Discover Your Sensitive Data

Start classifying your data today. Know exactly what sensitive information exists in your databases and protect it automatically.