Skip to main content

Data Profiling

Data Profiling computes column-level statistics for your data products, giving you visibility into data quality, distribution, and completeness directly within the catalog.

What is Data Profiling?

When profiling runs against a product, Qarion connects to the underlying data source, samples rows, and computes statistics for each column. The results are displayed in the product's Profile tab, providing an at-a-glance view of your data's shape and health.

Viewing Profiles

  1. Open any data product in the catalog
  2. Navigate to the Profile tab in the product detail view
  3. Each column shows its computed statistics

Statistics Computed

MetricDescriptionProfiling Depth
Null RatioPercentage of null valuesQuick, Full
Distinct CountNumber of unique values (cardinality)Quick, Full
Row CountNumber of rows sampledQuick, Full
Min / MaxMinimum and maximum values (numeric and date columns)Full
MeanAverage value (numeric columns only)Full
Top ValuesMost frequent values with their countsFull
HistogramValue distribution (numeric) or category frequenciesFull

Profiling Depth

You can choose between two profiling levels:

  • Quick — Computes only null ratio and cardinality. Fast and lightweight, suitable for routine monitoring.
  • Full — Computes all statistics including distributions, histograms, and top values. Takes longer but provides comprehensive insight.

Connection Requirements

Profiling connects to the same database as the product's scraping connector. The connection configuration is built by merging:

  1. The connector's base connection_config (host, database, schema)
  2. The linked credential's connection_config overlay (port, SSL options)
  3. The credential's encrypted secret as the password
important

A valid credential must be configured on the source system linked to the product's connector. If the credential is missing or expired, profiling will fail with a connection error.

Running Profiling

On-Demand

From the product's Profile tab, click Run Profile and select the desired depth. The profiling runs immediately against the live data source.

Automatic (Post-Sync)

Profiling can run automatically after a successful metadata sync. When a connector completes a sync, the platform triggers a quick profile for newly discovered or updated products.

Sample Size

Profiling samples a configurable number of rows (default: 10,000, maximum: 100,000) to balance accuracy with query cost and execution time.

PII-Safe Profiling

Columns flagged as is_pii = true in the catalog are automatically skipped during profiling. This ensures sensitive data is never sampled or stored in profile statistics.

Each time profiling runs, a snapshot is archived. You can compare profiles over time to detect drift:

  • Growing null ratios may indicate upstream pipeline issues
  • Sudden cardinality changes may signal data quality problems
  • Distribution shifts may require investigation

Access historical profiles from the History link in the Profile tab for any specific column.

Tips

  • Run full profiles periodically (e.g., weekly) and quick profiles after every sync
  • Use profiling results to inform quality check creation — columns with high null ratios are candidates for null-count checks
  • Review cardinality trends to detect schema drift early