Metadata Engine: connections, snapshots, tests, freshness¶
The Metadata Engine is Datahub's bridge to your physical data sources. Connect a Databricks workspace, an ADLS storage account, or a PostgreSQL database; the engine discovers schemas / tables / columns, snapshots them on a schedule, runs data tests and freshness checks, profiles tables for anomalies, and feeds the results into the Catalog so stewards see what's there now and what changed.
If the Catalog is what we know about our data, the Metadata Engine is how we know it.
When to choose this¶
Reach for the Metadata Engine when you want to:
- Connect a data source. Create a connection to ADLS Gen2 or PostgreSQL (Databricks SQL is configured separately under
/admin/integrations). Credentials live in Azure Key Vault, never in our database. - Discover a schema and import it into the Catalog. Run extraction, snapshot the result, promote selected schemas/tables to catalog assets in one click.
- Detect schema drift. Compare today's snapshot to last week's; the diff shows added / removed / renamed columns and type changes.
- Run data tests. Uniqueness, not-null, accepted values, custom SQL — schedule them, alert on failure.
- Monitor freshness. "The orders table should have new rows in the last 30 minutes." If not, raise an alert.
- Profile tables for anomalies. Row count, null %, distinct count, min/max — tracked across snapshots with anomaly detection.
- Schedule it all. A scheduler runs extraction, snapshots, tests, profiling, and freshness on cron schedules per connection.
You do not need the Metadata Engine for:
- Manually-registered catalog assets (you can register without a connection).
- Querying data (the Engine reads metadata, not row contents).
- Connections to BI tools or dashboards (use the Dashboards module).
What the Metadata Engine looks like¶
| Surface | Where | What you see |
|---|---|---|
| Overview | /metadata-engine |
KPIs (connections, snapshots, recent failures), trend chart of snapshots over the last 14 days, top-status connection cards. |
| Connections | /metadata-engine/connections |
Grid of connection cards — name, type, last test status, last snapshot, schedules. |
| Connection detail | /metadata-engine/connections/{id} |
Tabs: Overview, Extraction, Snapshots, Catalog, Schedules, Tests, Profiling, Freshness, Tasks. |
| Snapshots | /metadata-engine/snapshots |
List of all snapshots across connections; filter by connection, compare any two. |
| Snapshot detail | /metadata-engine/snapshots/{id} |
Full schema → table → column hierarchy with stats. |
| Data tests | /metadata-engine/data-tests |
All test definitions across connections; filter by status, latest run results. |
| Insights (profiling) | /metadata-engine/insights |
Profiling history and trend charts for any catalog table. |
| Freshness | /metadata-engine/freshness |
Per-table freshness rules and recent check results. |
| Scheduler | /metadata-engine/scheduler |
All schedules across connections; recent runs. |
| Task logs | /metadata-engine/tasks |
Background task history (extraction, snapshot, test, profiling, freshness runs). |
Concepts¶
| Concept | What it is |
|---|---|
| Connection | A registered data source (ADLS Gen2 or PostgreSQL) with type-specific config + a Key Vault secret reference. |
| Adapter | The connector for a connection type. ADLS uses a direct Delta reader (deltalake); PostgreSQL uses psycopg + dbt for tests/profiling. |
| Extraction | A live discovery of the source's schemas / tables / columns. Produces an in-memory snapshot. |
| Snapshot | A persisted point-in-time capture of metadata, linked to a connection. Used for diffs and history. |
| Catalog (per-connection) | A deduplicated view of every schema / table / column ever seen on the connection — soft-deletes columns when they disappear, retains history. |
| Data test | A test definition (uniqueness, not-null, accepted values, custom SQL) on a column or table. |
| Profiling run | A row-count + per-column null/distinct/min/max/mean snapshot, used for anomaly detection. |
| Freshness rule | "Column X must have a value within the last N minutes / hours / days." |
| Schedule | A cron expression + which tasks (extract, test, profile, freshness) to run, per connection. |
| JobRun | A single execution of a schedule. |
Setup — what an admin needs to do once¶
| Prereq | Where | Why |
|---|---|---|
| Azure Key Vault | Configured by the platform — credentials live there. | Connection secrets never touch the database. |
| At least one connection | /metadata-engine/connections → New connection |
The engine is empty until you connect a source. |
| Roles | /rolegroups |
metadata_engine.resources.read, metadata_engine.resources.write. Both default to Administrator. |
| Scheduler enabled (automatic) | The platform runs a periodic tick that picks up due schedules. | No admin action required. |
Creating a connection¶
ADLS Gen2:
/metadata-engine/connections→ New connection → ADLS.- Provide: name, storage account, tenant ID, client ID, client secret. The secret is written to Key Vault.
- Test connection — the platform calls Azure to list containers. If it works, you'll see a green check.
- Save.
PostgreSQL:
- New connection → PostgreSQL.
- Provide: host, port, database, username, password, SSL mode.
- Test connection — the platform issues
SELECT 1against the host. - Save.
The setup is transactional — if Key Vault write fails, the database row is rolled back so you don't end up with an orphan connection.
Discover, snapshot, and promote¶
The standard "what's in this source?" loop:
- Open the connection → Extraction tab → Setup session. For PostgreSQL this materialises a tiny dbt project under the covers; for ADLS it uses a lightweight Delta reader.
- Extract — the engine reads schemas / tables / columns from the source.
- Save snapshot — persists the result. Each snapshot has a name (auto-generated or custom).
- Browse — Catalog tab shows the deduplicated inventory; expand any schema → table → columns.
- Compare — Snapshots tab → Compare → pick two snapshots → see added / removed / renamed columns and type changes.
- Promote to Catalog — select schemas/tables → Promote → catalog assets are created (in draft) with column metadata pre-filled. From there the standard catalog approval flow applies (see Data Catalog).
Data tests¶
Data tests live per connection. Two flavours:
- Built-in tests — uniqueness, not-null, accepted-values, relationships. Pick a table + column; the engine generates the SQL.
- Custom SQL tests — write the test as a SELECT that returns failing rows. Empty result = pass.
Run them ad-hoc (Run now) or via a schedule. Failed tests show in the Tests tab and can trigger a Logic Engine alert (configure the rule with trigger source = data test).
Profiling¶
For every catalog table on a connection you can run profiling:
- Row count + per-column null %, distinct count, min / max / mean (numeric).
- Stored in
profilingrunsandprofilingcolumns. - Trend view shows deltas across runs with anomaly detection:
| Anomaly | Severity | Trigger |
|---|---|---|
| Null % increased > 10pp | warning | A column got noticeably nullier. |
| Distinct count dropped to 0 | critical | A column lost all its values. |
| Row count dropped > 50% | critical | Table shrank. |
Freshness¶
Per-table freshness rules:
- Pick a column (typically a timestamp like
updated_at). - Set a threshold (e.g. 30 minutes).
- The engine checks the max value of that column on the cadence you schedule.
- If the value is older than the threshold, the rule fails — visible on the Freshness tab and (optionally) wired to a Logic Engine alert.
Multiple rules per table are supported — use them when you have several timestamp columns that should each be fresh.
Scheduler¶
Per connection you can schedule any combination of: extract → snapshot → tests → profiling → freshness. A schedule has a cron expression, a timezone, and a list of tasks. The scheduler tick (every minute) finds due schedules, defers them to the background task queue, and recomputes next_run_at.
JobRuns capture per-schedule history — start time, duration, status, the per-task results, and any error message.
Limitations¶
| Limit | Why | Workaround |
|---|---|---|
| Only ADLS Gen2 and PostgreSQL connections are supported in this module. | Databricks lives under /admin/integrations (Databricks SQL); other sources are roadmapped. |
For now, materialise the source to ADLS or Postgres. |
| Credentials are write-once via the UI. | Key Vault rotation is not automated through the engine. | Rotate the secret in Key Vault directly; the engine reads on next connection. |
| Snapshots store metadata only — no row contents. | Privacy + storage. | Use Insights / a profiling run for sample stats; row data stays in the source. |
| Profiling samples up to a configurable cap per column. | Avoids OOM on huge tables. | Pre-aggregate in the source if you need full-population stats. |
| Schedules can't run sub-minute. | Cron + the tick interval. | Use Logic Engine for sub-minute alerts. |
| Each connection requires its own owner / approver setup downstream in the catalog. | Connection != asset; promotion is the explicit governance step. | Promote in batches once the snapshot looks right. |
Audit & compliance¶
| Question a CISO might ask | Where to look |
|---|---|
| "Where do connection secrets live?" | Azure Key Vault, never in our database. |
| "Did the engine read sensitive rows from production?" | No — only metadata (schema, columns, types) is read by extraction; profiling reads row counts and column stats only; tests read what your test SQL specifies. |
| "When did the customers table last change?" | Catalog tab → table → Column history; Snapshots → compare two snapshots. |
| "Who set up this connection?" | Connection detail → audit metadata. |
| "Show every freshness check that failed last month." | Freshness tab → filter by status + date range. |
| "Can someone bypass approval and write directly to the catalog?" | No — promotion creates draft assets; publishing requires reviewer approval. |
Troubleshooting¶
| Symptom | Likely cause | Fix |
|---|---|---|
| Test connection fails on ADLS | Wrong tenant / client ID / client secret, or RBAC on the storage account is missing. | Verify the App Registration credentials and the Storage Blob Data Reader role on the account. |
| Extraction returns 0 schemas on PostgreSQL | The user lacks USAGE on the schemas. |
Grant USAGE on the schemas + SELECT on the tables to the connection user. |
| Snapshot diff shows wrongly-renamed columns | Heuristic match. | Ignore — promote columns explicitly to override. |
| Freshness rule failing all the time | The column the rule watches isn't actually a freshness signal (e.g. it's nullable). | Pick a different column or change the threshold. |
Profiling run errors with Decimal cast |
Numeric column has unusual precision. | Expected behaviour after the platform's coercion patch — refresh; if persistent, file a ticket. |
| Schedule shows running but never completes | The Procrastinate worker died mid-run. | Cancel the JobRun; restart the worker; re-run the schedule. |
See also¶
- Data Catalog — promote snapshot tables into governed catalog assets.
- Logic Engine — alert on test failures, freshness misses, profiling anomalies.
- Databricks: per-user OAuth — for Databricks SQL connections (different module).
- HERC — ask "what schemas does the prod connection see?" to navigate without clicking.