dbt on Azure Blob Storage¶
This guide walks through connecting Datahub to dbt artifact files stored in Azure Blob Storage. By the end you will have a working dbt connection, a first sync run, and nodes visible in the Datahub UI.
Azure Blob Storage is the only supported artifact source. If your artifacts currently live elsewhere (local filesystem, S3, GCS), copy them to a Blob container before proceeding.
Prerequisites¶
| Prereq | Detail |
|---|---|
| dbt project generating artifacts | At minimum manifest.json must be present. See dbt Artifact Files for how to generate all three. |
| Azure Storage Account | A storage account in your Azure subscription with a container dedicated to dbt artifacts. |
| Shared Service Principal | Configured once in Metadata Engine → Setup → Authentication. The same App Registration is reused by all ADLS and dbt connections. The identity needs Storage Blob Data Reader on the container (or the storage account). |
| Azure Key Vault | Configured on the Datahub platform — the client secret is stored here, never in the Datahub database. No admin action required from you; the platform handles it. |
Note: You no longer need to enter service principal credentials per connection. Configure the shared service principal once on the Setup page, then all dbt and ADLS connections use it automatically.
Step 1 — Upload artifacts to Azure Blob¶
After running your dbt pipeline, copy the output files from target/ to your Blob container:
dbt-artifacts/ ← container
└── prod/ ← optional prefix / subdirectory
├── manifest.json
├── catalog.json
└── run_results.json
Note the container name (dbt-artifacts) and the blob prefix (prod/). You will need both when creating the connection. Leave the prefix blank if the files sit at the container root.
Uploading with Azure CLI¶
The recommended way to upload artifacts is with the Azure CLI. Install it with brew install azure-cli (macOS), apt install azure-cli (Debian/Ubuntu), or via the Azure CLI install page.
One-time login:
az login
# Or for service principal login (CI/CD):
az login --service-principal \
--username $AZURE_CLIENT_ID \
--password $AZURE_CLIENT_SECRET \
--tenant $AZURE_TENANT_ID
Upload all artifact files:
# From your dbt project root
az storage blob upload-batch \
--account-name "mystorageaccount" \
--destination "dbt-artifacts" \
--destination-path "prod" \
--source "./target" \
--pattern "*.json" \
--overwrite \
--auth-mode login
| Flag | What it does |
|---|---|
--account-name |
Azure Storage account name |
--destination |
Container name |
--destination-path |
Prefix/folder inside the container — omit to upload to the root |
--source |
Local folder to upload from — use your dbt target/ directory |
--pattern "*.json" |
Upload only .json files, excluding compiled SQL, logs, and other output |
--overwrite |
Replace existing blobs so Datahub always gets the latest run |
--auth-mode login |
Use the current az login identity — no need to pass account keys |
If you prefer a storage account key over az login:
az storage blob upload-batch \
--account-name "mystorageaccount" \
--account-key "$AZURE_STORAGE_KEY" \
--destination "dbt-artifacts" \
--destination-path "prod" \
--source "./target" \
--pattern "*.json" \
--overwrite
Verify the upload:
az storage blob list \
--account-name "mystorageaccount" \
--container-name "dbt-artifacts" \
--prefix "prod/" \
--auth-mode login \
--output table
CI/CD integration (Azure DevOps — recommended)¶
The Setup page provides a guided Azure DevOps integration:
- Create a variable group named
ci_variables_dbt_artifactsin your Azure DevOps project with: client_id_dbt_artifactsclient_secret_dbt_artifacts-
tenant_id_dbt_artifacts -
Add the variable group at the top of your pipeline YAML:
-
Download the upload script — on the dbt connection card in the Setup page, click Download Script to get the generic
upload_dbt_artifacts.shscript. Place it in your repository (e.g.,devops_pipelines/scripts/azure/upload_dbt_artifacts.sh). -
Copy the pipeline YAML — click Copy Pipeline YAML on the connection card. This generates a pipeline job snippet with your storage account, container, and blob prefix pre-filled:
- job: upload_dbt_artifacts
dependsOn: UV_initialization
displayName: "Upload dbt artifacts to Azure Blob"
steps:
- script: |
devops_pipelines/scripts/azure/upload_dbt_artifacts.sh
env:
AZURE_TENANT_ID: $(tenant_id_dbt_artifacts)
AZURE_CLIENT_ID: $(client_id_dbt_artifacts)
AZURE_CLIENT_SECRET: $(client_secret_dbt_artifacts)
AZURE_STORAGE_ACCOUNT: mystorageaccount
AZURE_CONTAINER_NAME: dbt-artifacts
AZURE_BLOB_PREFIX: prod/
# DBT_PROJECT_DIR: $(dbt_project_dir) # Optional — defaults to repo root
displayName: Upload dbt artifacts
CI/CD integration (GitHub Actions)¶
- name: Run dbt
run: |
dbt run --profiles-dir .
dbt test --profiles-dir .
dbt docs generate --profiles-dir .
- name: Upload dbt artifacts to Azure Blob
run: |
az storage blob upload-batch \
--account-name "${{ secrets.AZURE_STORAGE_ACCOUNT }}" \
--destination "${{ secrets.AZURE_CONTAINER_NAME }}" \
--destination-path "prod" \
--source "./target" \
--pattern "*.json" \
--overwrite
env:
AZURE_STORAGE_KEY: ${{ secrets.AZURE_STORAGE_KEY }}
- name: Trigger Datahub sync
run: |
curl -X POST "${{ secrets.DATAHUB_URL }}/api/metadata_engine/dbt-connections/${{ secrets.DATAHUB_DBT_CONNECTION_ID }}/sync" \
-H "Authorization: Bearer ${{ secrets.DATAHUB_TOKEN }}"
Step 2 — Create the connection in Datahub¶
- Ensure the shared service principal is configured: go to Metadata Engine → Setup → Authentication and enter your App Registration's tenant ID, client ID, and client secret. This only needs to be done once.
- Navigate to Metadata Engine → Setup → dbt Connections tab → Add Connection.
- Fill in the form:
| Field | Example | Required | Notes |
|---|---|---|---|
| Name | prod-dbt |
Yes | Internal identifier. Lowercase letters, numbers, and hyphens only. |
| Storage account name | mystorageaccount |
Yes | The Azure Storage account name — not the full URL. |
| Container name | dbt-artifacts |
Yes | The Blob container that holds your artifact files. |
| Blob prefix | prod/ |
No | Path prefix within the container. Include a trailing slash. Leave blank if files are at the container root. |
- Click Create. The connection uses the shared service principal for authentication — no per-connection credentials needed.
Step 3 — Test the connection¶
Click Test connection on the connection detail page. Datahub authenticates with Azure using the App Registration credentials and attempts to list blobs in the container at the configured prefix. A green check means credentials are valid and the container is accessible.
Common failures:
| Result | Likely cause |
|---|---|
| Authentication error | Wrong tenant ID, client ID, or client secret |
| Container not found | Storage account name or container name has a typo |
| Access denied | App Registration is missing Storage Blob Data Reader on the container |
Step 4 — Run your first sync¶
Click Sync now on the connection detail page. Datahub:
- Downloads
manifest.json(andcatalog.json/run_results.jsonif present) from the configured Blob path. - Parses each file and persists nodes, columns, lineage edges, and test results.
- Creates a Sync Run record with counts of everything processed.
The sync is synchronous — the button will show a loading state until it completes. For large projects this typically takes a few seconds.
After the sync, the Sync Runs section shows:
| Field | What it means |
|---|---|
| Nodes synced | Number of dbt nodes (models + sources + seeds + snapshots) |
| Columns synced | Total column records across all nodes |
| Edges synced | Lineage edges (upstream/downstream pairs) |
| Test results synced | Individual test outcome records |
| Synced at | Timestamp of the sync |
Browsing results¶
Node list¶
The Nodes tab shows all dbt nodes for the selected sync run. Filter by:
- Resource type — model, source, seed, or snapshot
- Schema — the database schema the node lives in
- Search — name search across all nodes
Node detail¶
Click any node to open its detail view. Four tabs:
| Tab | What you see |
|---|---|
| Schema | All columns with name, data type, and description. Types sourced from catalog.json where available, otherwise from manifest.json. |
| dbt | Raw SQL (pre-Jinja) and compiled SQL (post-Jinja). Compiled SQL requires compiled_code in manifest.json — see dbt Artifact Files. |
| Lineage | Dependency graph showing upstream nodes (what this node reads from) and downstream nodes (what reads from this node). |
| Data Quality | Test results for this node from the selected sync run. Shows test name, status, failure count, message, and execution time. |
Viewing a previous sync run¶
Use the Sync Run version selector at the top of the connection detail page to browse any historical sync. Nodes, columns, lineage, and test results all reflect the state at that sync's timestamp.
Automating syncs from CI/CD¶
Datahub does not schedule dbt syncs automatically. Trigger a sync via the API at the end of your dbt pipeline, after the artifact upload:
curl -X POST https://your-datahub-instance/api/metadata_engine/dbt-connections/{id}/sync \
-H "Authorization: Bearer $DATAHUB_TOKEN"
Full GitHub Actions and Azure DevOps pipeline examples — including the az storage blob upload-batch step — are in Step 1 above.
Limitations¶
| Limit | Workaround |
|---|---|
| Only Azure Blob Storage is supported as an artifact source. | Upload artifacts to Azure Blob from your CI pipeline before syncing. |
| Syncs are on-demand — there is no built-in cron scheduler for dbt connections. | Trigger the sync endpoint from your CI/CD pipeline post-dbt run. |
Lineage extraction requires compiled_code in manifest.json. |
Run dbt compile or dbt docs generate — not just dbt run — to populate compiled SQL. |
| Client secrets are write-once via the UI. | To rotate a secret, edit the connection and provide the new secret value. |
Troubleshooting¶
| Symptom | Likely cause | Fix |
|---|---|---|
| Test connection fails | Wrong credentials or missing RBAC | Verify the App Registration has Storage Blob Data Reader on the container; double-check tenant ID, client ID, and secret. |
| Sync completes but 0 nodes synced | Wrong blob prefix or manifest.json not present |
Confirm the prefix exactly matches the path in your container; ensure manifest.json was uploaded before syncing. |
| Lineage tab is empty on all nodes | compiled_code absent in manifest.json |
Run dbt docs generate (not just dbt run) so the manifest includes compiled SQL, which is needed to extract lineage. |
| Test Results tab is empty | run_results.json absent or not uploaded |
Run dbt test and upload run_results.json to the same container prefix before syncing. |
| Columns show no data types | catalog.json absent |
Run dbt docs generate and upload catalog.json; without it, types fall back to manifest.json declarations which may be blank. |
| Sync fails with authentication error after secret rotation | The secret in Key Vault is stale | Edit the connection, provide the new client secret, and save to update Key Vault. |
See also¶
- dbt Artifact Files — what manifest.json, catalog.json, and run_results.json contain and how to generate them.
- dbt Integration — overview of the integration: features, UI surfaces, and concepts.
- Metadata Engine — the broader metadata platform (ADLS, PostgreSQL connections, snapshots, profiling, freshness).