dbt on Azure Blob Storage¶

This guide walks through connecting Datahub to dbt artifact files stored in Azure Blob Storage. By the end you will have a working dbt connection, a first sync run, and nodes visible in the Datahub UI.

Azure Blob Storage is the only supported artifact source. If your artifacts currently live elsewhere (local filesystem, S3, GCS), copy them to a Blob container before proceeding.

Prerequisites¶

Prereq	Detail
dbt project generating artifacts	At minimum `manifest.json` must be present. See dbt Artifact Files for how to generate all three.
Azure Storage Account	A storage account in your Azure subscription with a container dedicated to dbt artifacts.
Shared Service Principal	Configured once in Metadata Engine → Setup → Authentication. The same App Registration is reused by all ADLS and dbt connections. The identity needs Storage Blob Data Reader on the container (or the storage account).
Azure Key Vault	Configured on the Datahub platform — the client secret is stored here, never in the Datahub database. No admin action required from you; the platform handles it.

Note: You no longer need to enter service principal credentials per connection. Configure the shared service principal once on the Setup page, then all dbt and ADLS connections use it automatically.

Step 1 — Upload artifacts to Azure Blob¶

After running your dbt pipeline, copy the output files from target/ to your Blob container:

dbt-artifacts/          ← container
└── prod/               ← optional prefix / subdirectory
    ├── manifest.json
    ├── catalog.json
    └── run_results.json

Note the container name (dbt-artifacts) and the blob prefix (prod/). You will need both when creating the connection. Leave the prefix blank if the files sit at the container root.

Uploading with Azure CLI¶

The recommended way to upload artifacts is with the Azure CLI. Install it with brew install azure-cli (macOS), apt install azure-cli (Debian/Ubuntu), or via the Azure CLI install page.

One-time login:

az login
# Or for service principal login (CI/CD):
az login --service-principal \
  --username $AZURE_CLIENT_ID \
  --password $AZURE_CLIENT_SECRET \
  --tenant $AZURE_TENANT_ID

Upload all artifact files:

# From your dbt project root
az storage blob upload-batch \
  --account-name "mystorageaccount" \
  --destination "dbt-artifacts" \
  --destination-path "prod" \
  --source "./target" \
  --pattern "*.json" \
  --overwrite \
  --auth-mode login

Flag	What it does
`--account-name`	Azure Storage account name
`--destination`	Container name
`--destination-path`	Prefix/folder inside the container — omit to upload to the root
`--source`	Local folder to upload from — use your dbt `target/` directory
`--pattern "*.json"`	Upload only `.json` files, excluding compiled SQL, logs, and other output
`--overwrite`	Replace existing blobs so Datahub always gets the latest run
`--auth-mode login`	Use the current `az login` identity — no need to pass account keys

If you prefer a storage account key over az login:

az storage blob upload-batch \
  --account-name "mystorageaccount" \
  --account-key "$AZURE_STORAGE_KEY" \
  --destination "dbt-artifacts" \
  --destination-path "prod" \
  --source "./target" \
  --pattern "*.json" \
  --overwrite

Verify the upload:

az storage blob list \
  --account-name "mystorageaccount" \
  --container-name "dbt-artifacts" \
  --prefix "prod/" \
  --auth-mode login \
  --output table

CI/CD integration (Azure DevOps — recommended)¶

The Setup page provides a guided Azure DevOps integration:

Create a variable group named ci_variables_dbt_artifacts in your Azure DevOps project with:
client_id_dbt_artifacts
client_secret_dbt_artifacts
tenant_id_dbt_artifacts

Add the variable group at the top of your pipeline YAML:

variables:
  - group: ci_variables_dbt_artifacts

Download the upload script — on the dbt connection card in the Setup page, click Download Script to get the generic upload_dbt_artifacts.sh script. Place it in your repository (e.g., devops_pipelines/scripts/azure/upload_dbt_artifacts.sh).
Copy the pipeline YAML — click Copy Pipeline YAML on the connection card. This generates a pipeline job snippet with your storage account, container, and blob prefix pre-filled:

- job: upload_dbt_artifacts
  dependsOn: UV_initialization
  displayName: "Upload dbt artifacts to Azure Blob"
  steps:
    - script: |
        devops_pipelines/scripts/azure/upload_dbt_artifacts.sh
      env:
        AZURE_TENANT_ID: $(tenant_id_dbt_artifacts)
        AZURE_CLIENT_ID: $(client_id_dbt_artifacts)
        AZURE_CLIENT_SECRET: $(client_secret_dbt_artifacts)
        AZURE_STORAGE_ACCOUNT: mystorageaccount
        AZURE_CONTAINER_NAME: dbt-artifacts
        AZURE_BLOB_PREFIX: prod/
        # DBT_PROJECT_DIR: $(dbt_project_dir)  # Optional — defaults to repo root
      displayName: Upload dbt artifacts

CI/CD integration (GitHub Actions)¶

- name: Run dbt
  run: |
    dbt run --profiles-dir .
    dbt test --profiles-dir .
    dbt docs generate --profiles-dir .

- name: Upload dbt artifacts to Azure Blob
  run: |
    az storage blob upload-batch \
      --account-name "${{ secrets.AZURE_STORAGE_ACCOUNT }}" \
      --destination "${{ secrets.AZURE_CONTAINER_NAME }}" \
      --destination-path "prod" \
      --source "./target" \
      --pattern "*.json" \
      --overwrite
  env:
    AZURE_STORAGE_KEY: ${{ secrets.AZURE_STORAGE_KEY }}

- name: Trigger Datahub sync
  run: |
    curl -X POST "${{ secrets.DATAHUB_URL }}/api/metadata_engine/dbt-connections/${{ secrets.DATAHUB_DBT_CONNECTION_ID }}/sync" \
      -H "Authorization: Bearer ${{ secrets.DATAHUB_TOKEN }}"

Step 2 — Create the connection in Datahub¶

Ensure the shared service principal is configured: go to Metadata Engine → Setup → Authentication and enter your App Registration's tenant ID, client ID, and client secret. This only needs to be done once.
Navigate to Metadata Engine → Setup → dbt Connections tab → Add Connection.
Fill in the form:

Field	Example	Required	Notes
Name	`prod-dbt`	Yes	Internal identifier. Lowercase letters, numbers, and hyphens only.
Storage account name	`mystorageaccount`	Yes	The Azure Storage account name — not the full URL.
Container name	`dbt-artifacts`	Yes	The Blob container that holds your artifact files.
Blob prefix	`prod/`	No	Path prefix within the container. Include a trailing slash. Leave blank if files are at the container root.

Click Create. The connection uses the shared service principal for authentication — no per-connection credentials needed.

Step 3 — Test the connection¶

Click Test connection on the connection detail page. Datahub authenticates with Azure using the App Registration credentials and attempts to list blobs in the container at the configured prefix. A green check means credentials are valid and the container is accessible.

Common failures:

Result	Likely cause
Authentication error	Wrong tenant ID, client ID, or client secret
Container not found	Storage account name or container name has a typo
Access denied	App Registration is missing Storage Blob Data Reader on the container

Step 4 — Run your first sync¶

Click Sync now on the connection detail page. Datahub:

Downloads manifest.json (and catalog.json / run_results.json if present) from the configured Blob path.
Parses each file and persists nodes, columns, lineage edges, and test results.
Creates a Sync Run record with counts of everything processed.

The sync is synchronous — the button will show a loading state until it completes. For large projects this typically takes a few seconds.

After the sync, the Sync Runs section shows:

Field	What it means
Nodes synced	Number of dbt nodes (models + sources + seeds + snapshots)
Columns synced	Total column records across all nodes
Edges synced	Lineage edges (upstream/downstream pairs)
Test results synced	Individual test outcome records
Synced at	Timestamp of the sync

Browsing results¶

Node list¶

The Nodes tab shows all dbt nodes for the selected sync run. Filter by:

Resource type — model, source, seed, or snapshot
Schema — the database schema the node lives in
Search — name search across all nodes

Node detail¶

Click any node to open its detail view. Four tabs:

Tab	What you see
Schema	All columns with name, data type, and description. Types sourced from `catalog.json` where available, otherwise from `manifest.json`.
dbt	Raw SQL (pre-Jinja) and compiled SQL (post-Jinja). Compiled SQL requires `compiled_code` in `manifest.json` — see dbt Artifact Files.
Lineage	Dependency graph showing upstream nodes (what this node reads from) and downstream nodes (what reads from this node).
Data Quality	Test results for this node from the selected sync run. Shows test name, status, failure count, message, and execution time.

Viewing a previous sync run¶

Use the Sync Run version selector at the top of the connection detail page to browse any historical sync. Nodes, columns, lineage, and test results all reflect the state at that sync's timestamp.

Automating syncs from CI/CD¶

Datahub does not schedule dbt syncs automatically. Trigger a sync via the API at the end of your dbt pipeline, after the artifact upload:

curl -X POST https://your-datahub-instance/api/metadata_engine/dbt-connections/{id}/sync \
  -H "Authorization: Bearer $DATAHUB_TOKEN"

Full GitHub Actions and Azure DevOps pipeline examples — including the az storage blob upload-batch step — are in Step 1 above.

Limitations¶

Limit	Workaround
Only Azure Blob Storage is supported as an artifact source.	Upload artifacts to Azure Blob from your CI pipeline before syncing.
Syncs are on-demand — there is no built-in cron scheduler for dbt connections.	Trigger the sync endpoint from your CI/CD pipeline post-`dbt run`.
Lineage extraction requires `compiled_code` in `manifest.json`.	Run `dbt compile` or `dbt docs generate` — not just `dbt run` — to populate compiled SQL.
Client secrets are write-once via the UI.	To rotate a secret, edit the connection and provide the new secret value.

Troubleshooting¶

Symptom	Likely cause	Fix
Test connection fails	Wrong credentials or missing RBAC	Verify the App Registration has Storage Blob Data Reader on the container; double-check tenant ID, client ID, and secret.
Sync completes but 0 nodes synced	Wrong blob prefix or `manifest.json` not present	Confirm the prefix exactly matches the path in your container; ensure `manifest.json` was uploaded before syncing.
Lineage tab is empty on all nodes	`compiled_code` absent in `manifest.json`	Run `dbt docs generate` (not just `dbt run`) so the manifest includes compiled SQL, which is needed to extract lineage.
Test Results tab is empty	`run_results.json` absent or not uploaded	Run `dbt test` and upload `run_results.json` to the same container prefix before syncing.
Columns show no data types	`catalog.json` absent	Run `dbt docs generate` and upload `catalog.json`; without it, types fall back to `manifest.json` declarations which may be blank.
Sync fails with authentication error after secret rotation	The secret in Key Vault is stale	Edit the connection, provide the new client secret, and save to update Key Vault.