dbt Integration: nodes, lineage, tests, SQL¶
The dbt integration reads artifact files generated by your dbt project and makes models, sources, seeds, and snapshots first-class objects in Datahub. Once connected, you can browse every dbt node, view its raw and compiled SQL, trace upstream and downstream lineage, and inspect test results — all without running dbt inside Datahub.
If your warehouse holds the transformed data and your dbt project describes how it was built, the dbt integration is what makes that description visible to the rest of your organisation.
When to choose this¶
Reach for the dbt integration when you want to:
- Explore your dbt project. Browse all models, sources, seeds, and snapshots with their schema, materialization, and tags.
- View SQL. See the raw SQL from
schema.ymland the compiled SQL after Jinja rendering, per node. - Trace lineage. Visualise which models depend on which sources or models, and which downstream models a source feeds.
- Monitor test results. See whether each dbt test passed, failed, warned, or errored — per sync run.
- Inspect column metadata. Column names, data types, and descriptions sourced from your manifest and catalog.
You do not need the dbt integration for:
- Running dbt (Datahub does not execute dbt commands).
- Managing dbt profiles or projects.
- Scheduling dbt jobs (trigger syncs from your own CI/CD pipeline).
- Importing dbt metadata into the Data Catalog — promotion from dbt to the Catalog is not yet wired; the dbt integration is a read-only explorer.
What Datahub integrates from dbt¶
| Category | What you get |
|---|---|
| Nodes | All models, sources, seeds, and snapshots — name, schema, database, resource type, materialization strategy, tags, and FQN path. |
| SQL | Raw SQL (from schema.yml) and compiled SQL (from dbt compile / dbt docs generate) for each node. |
| Columns | Column names, data types, and descriptions. Manifest provides descriptions; catalog.json enriches the types with database-actual values. |
| Lineage | Directed dependency graph: upstream and downstream edges derived from parent_map in manifest.json. |
| Test results | Pass / fail / warn / error status per dbt test, with failure count, message, and execution time — one snapshot per sync run. |
How it works¶
- Your dbt project generates artifact files (
manifest.json,catalog.json,run_results.json) during adbt run/dbt test/dbt docs generatecycle. - You — or your CI/CD pipeline — upload those files to an Azure Blob Storage container.
- You trigger a sync in Datahub; the platform downloads the artifacts, parses them, and persists the results as a Sync Run.
Each sync run is a versioned snapshot. You can compare nodes and test results across runs by selecting a previous run from the version picker.
What the dbt integration looks like¶
| Surface | Route | What you see |
|---|---|---|
| dbt Connections | /metadata-engine/dbt-connections |
Grid of configured dbt artifact connections with name, last sync timestamp, and node count. |
| Connection detail | /metadata-engine/dbt-connections/{id} |
Node browser, lineage tab, SQL viewer, test results, and sync history for one connection. |
| Node list | Node tab on the connection detail | Filterable, searchable table of all dbt nodes for the selected sync run. |
| Node detail | Click a node | Schema tab (columns + types + descriptions), dbt tab (raw + compiled SQL), Lineage tab (dependency graph), Data Quality tab (test results for this node). |
| Sync history | Sync Runs section | List of past syncs with counts of nodes, columns, edges, and test results synced. |
Concepts¶
| Concept | What it is |
|---|---|
| dbt Connection | A registered link to an Azure Blob Storage container that holds your dbt artifacts. Credentials live in Azure Key Vault. |
| Artifact source | The storage backend (Azure Blob) from which artifacts are downloaded on sync. |
| Sync Run | A single execution of the sync operation. Produces a versioned snapshot of all nodes, columns, edges, and test results. |
| dbt Node | One model, source, seed, or snapshot from your dbt project, identified by its unique_id. |
| Lineage edge | A directed dependency between two nodes (upstream → downstream), derived from parent_map in manifest.json. |
| Test result | The outcome of one dbt test execution — status, failure count, message, and execution time — tied to a specific sync run. |
Setup — what you need once¶
| Prereq | Where | Why |
|---|---|---|
| dbt project generating artifacts | Your own infrastructure | Datahub reads pre-built artifacts; it does not run dbt itself. |
| Azure Blob Storage | Your Azure subscription | The artifact source. One container per dbt project is a common pattern. |
| App Registration | Azure Active Directory | Datahub authenticates with a client ID + secret; the secret is stored in Key Vault. |
| Azure Key Vault | Configured on the Datahub platform | Connection secrets never touch the Datahub database. |
See dbt on Azure Blob Storage for a step-by-step setup guide.
See also¶
- dbt Artifact Files — what manifest.json, catalog.json, and run_results.json contain and how to generate them.
- dbt on Azure Blob Storage — step-by-step connection setup, form fields, sync, and troubleshooting.
- Metadata Engine — the broader metadata platform (ADLS, PostgreSQL connections, snapshots, profiling, freshness).
- Data Catalog — governed asset registry where promoted tables become first-class assets.