Automatic Data Classification
Discover and classify sensitive data across all your databases. Identify PII, financial data, health records, and credentials automatically with intelligent pattern matching and machine learning.
Know What Data You Have
You can't protect what you don't know about. DB Audit's classification engine automatically scans your databases to discover and categorize sensitive data, helping you understand your data landscape and meet compliance requirements.
What We Detect
Personal Identifiable Information (PII)
Automatically detect names, addresses, phone numbers, email addresses, and other personal data that can identify individuals.
- Full names and aliases
- Home and work addresses
- Phone numbers and email addresses
- Date of birth and age
Financial Data
Identify credit card numbers, bank accounts, transaction data, and other sensitive financial information.
- Credit card numbers (PCI DSS)
- Bank account and routing numbers
- Transaction amounts and history
- Tax identification numbers
Protected Health Information (PHI)
Detect medical records, diagnoses, treatment plans, and other HIPAA-protected health data.
- Medical record numbers
- Diagnoses and conditions
- Prescription information
- Insurance policy details
Authentication Credentials
Find passwords, API keys, tokens, and other authentication secrets that should never be stored in plain text.
- Passwords and password hashes
- API keys and tokens
- SSH keys and certificates
- OAuth secrets
Classification Levels
Data is automatically assigned to sensitivity levels based on its content and context. Use these levels to enforce appropriate access controls and audit requirements.
Data that can be freely shared without risk
Business data for internal use only
Sensitive data requiring access controls
Highly sensitive data with strict access limits
How Classification Works
Schema Discovery
DB Audit connects to your databases and catalogs all tables, columns, and their data types. This metadata is used to prioritize scanning.
Data Sampling
A representative sample of data is analyzed using configurable sampling strategies. Full scans are available for smaller datasets or compliance requirements.
Pattern Matching
Over 200 built-in patterns detect common sensitive data formats. Custom patterns can be added for organization-specific data.
ML Enhancement
Machine learning models improve classification accuracy by understanding context and detecting patterns that regex alone would miss.
Classification Assignment
Each column is assigned a data category and sensitivity level. Results are stored and used to automatically apply protection policies.
Configuration
Configure classification rules, custom patterns, and scanning schedules through YAML configuration or the dashboard.
# Data Classification Configuration
classification:
enabled: true
scan_schedule: "0 2 * * *" # Daily at 2 AM
# Built-in classifiers
classifiers:
- type: pii
enabled: true
sensitivity: high
patterns:
- name: ssn
regex: '\b\d{3}-\d{2}-\d{4}\b'
classification: restricted
- name: email
regex: '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
classification: confidential
- type: financial
enabled: true
sensitivity: high
patterns:
- name: credit_card
regex: '\b(?:\d{4}[- ]?){3}\d{4}\b'
classification: restricted
validate: luhn # Validate using Luhn algorithm
- type: phi
enabled: true
sensitivity: high
hipaa_compliant: true
- type: credentials
enabled: true
sensitivity: critical
alert_on_detection: true
# Custom classifiers
custom_classifiers:
- name: internal_ids
description: "Internal employee and project IDs"
patterns:
- regex: 'EMP-\d{6}'
classification: internal
- regex: 'PROJ-[A-Z]{3}-\d{4}'
classification: confidential
# Exclusions
exclusions:
tables:
- audit_log
- system_metadata
columns:
- created_at
- updated_at Scan Results
Run classification scans on-demand or on a schedule. Results are available in the dashboard, CLI, and as exportable reports.
# Classification Scan Results
dbaudit classify scan --database production
Scanning database: production
Tables scanned: 47
Columns analyzed: 312
Rows sampled: 1,000,000
Classification Results:
=======================
Table: customers
- email (VARCHAR) -> PII (Confidential)
- phone (VARCHAR) -> PII (Confidential)
- ssn (VARCHAR) -> PII (Restricted) [!]
- credit_card (VARCHAR) -> Financial (Restricted) [!]
Table: employees
- full_name (VARCHAR) -> PII (Confidential)
- salary (DECIMAL) -> Financial (Confidential)
- employee_id (VARCHAR) -> Internal
Table: medical_records
- diagnosis (TEXT) -> PHI (Restricted) [!]
- treatment (TEXT) -> PHI (Restricted) [!]
- insurance_id (VARCHAR) -> PHI (Confidential)
[!] Restricted data detected - Review recommended
Summary:
Public: 156 columns
Internal: 89 columns
Confidential: 52 columns
Restricted: 15 columns
Report saved to: ./classification-report-2024-01-15.json Classification results automatically integrate with security policies. When sensitive data is discovered, you can automatically apply masking, access controls, and audit logging.
Policy-Based Actions
Automatically mask classified data in query results based on user roles.
Restrict access to classified columns based on sensitivity levels.
Enhanced logging for all access to classified data for compliance.
Real-time alerts when classified data is accessed or exported.
Next Steps
Discover Your Sensitive Data
Start classifying your data today. Know exactly what sensitive information exists in your databases and protect it automatically.