Skip to content

Data Catalog: assets, columns, tags, and documents

The Data Catalog is the foundation of Datahub's governance model — the registry of every dataset that matters in your organisation. It records the asset's name, its columns, its owner, the business terms that describe it, the tags that classify it, and the documents that explain it. Other modules (Metrics, Glossary, Data Products, Contracts, Lineage, AI) all reference the Catalog as the single source of truth.

If your organisation can answer "which tables do we have, who owns them, and what do they mean?", you have a working catalog. The Data Catalog module is how you get there.

When to choose this

Reach for the Catalog when you want to:

  • Register a dataset. Make a Databricks table, an ADLS Delta table, an external dashboard, or any other asset visible inside Datahub with a name, owner, description, tags, and a status lifecycle.
  • Tag and classify your data. PII, GDPR-relevant, finance, customer — the platform's tags are first-class entities the Glossary, Metrics, AI, and Contracts modules all consume.
  • Manage documents that describe your data. Upload PDFs, paste markdown, or link external URLs. Documents are AI-chunked, semantically searched, and surfaced when relevant.
  • Govern lifecycle. Each asset transitions draft → under-review → published → archived with reviewer approval, just like the rest of the platform.
  • Find data by meaning. Vector embeddings let you search "tables containing customer email addresses" — even if no column is literally named that.

You do not need the Catalog for:

  • Asset-level row data — the catalog stores metadata, not the data itself.
  • Schema discovery from a connection (use the Metadata Engine — it discovers, the Catalog registers).
  • Glossary terms (different module, Business Glossary, but they link to assets via term-asset links).

What the Catalog looks like

Surface Where What you see
Overview /data-catalog KPIs (assets total, drafts in review, published, archived), trend chart of new assets, top tags, recent activity.
Assets list /data-catalog/assets Filterable table — name, type (table / view / dashboard / model / external), owner, status, tags, domain. KPIs at the top.
Asset detail — Overview /data-catalog/assets/{id} Description, owner team, type, source, status with state-machine actions, tags, glossary terms attached, audit metadata.
Asset detail — Columns /data-catalog/assets/{id} (Columns tab) Per-column metadata: name, type, description, PII flag, lineage.
Asset detail — Documents /data-catalog/assets/{id} (Documents tab) Documents attached to this asset, with previews.
Asset detail — Lineage /data-catalog/assets/{id} (Lineage tab) Upstream and downstream dependencies — other assets, metrics, dashboards, contracts.
Documents /data-catalog/documents A standalone document library across the catalog — searchable, filterable, with AI-extracted summaries.
System tables /admin/system-tables Tags, domains, departments, asset types — the registries the Catalog (and the rest of the platform) draws from.

Domain concepts

Concept What it is Notes
Data Asset A registered dataset (table, view, dashboard, ML model, external URL). Has a status lifecycle, owner, tags, columns.
Column Definition A named, typed column on an asset, with optional description and PII flag. Powers per-column lineage and PII discovery.
Tag A free-form classification (e.g. PII, GDPR, finance). System-managed and tenant-extensible. Used everywhere — glossary, metrics, alerts.
Domain A bigger-than-a-tag grouping (Finance, HR, Sales). System table. Most assets belong to one.
Document A PDF, markdown, or URL attached to one or more assets. AI-chunked + embedded for semantic search.
Document Chunk A semantic chunk of a document (paragraph-sized). What the AI retrieves when answering questions.

Setup — what an admin needs to do once

Prereq Where Why
Tags & domains seeded /admin/system-tables The platform ships sensible defaults; tenants typically add 5–20 of their own.
Owner teams /admin/teams Every asset has an owner team.
Roles /rolegroups datacatalog.assets.read, datacatalog.assets.write, datacatalog.assets.approve.
AI provider (optional) /admin/integrations → AI Provider Keys Required for AI-enhanced document processing (chunking, summarisation, embeddings).
Metadata Engine connection (optional) /admin/integrations → Metadata Engine If you want bulk import from a warehouse, connect it.

Registering an asset

Three on-ramps:

  1. Manual/data-catalog/assetsNew asset → name, type, source, description, owner, tags, columns. Use for one-off registrations.
  2. Bulk importImport button on /data-catalog/assets. Upload a CSV / Excel; the platform previews rows, validates, optionally enriches with AI (suggest descriptions, tags, owners), and lets you confirm row-by-row before commit.
  3. From a Metadata Engine snapshot — connect a warehouse, take a snapshot, then Promote selected schemas to catalog assets. The catalog row carries a back-pointer to the snapshot so column changes show up automatically on the next snapshot.

Whichever route, the new asset starts in draft status.

The asset lifecycle

Status Meaning Who can move it
Draft Author is iterating; not yet visible to consumers in default lists. The author or anyone with datacatalog.assets.write.
Under review Submitted for approval; reviewers see it in their Tasks inbox. Triggered by Submit for review.
Published Source of truth; visible to everyone with read access; can be wrapped in Data Products and linked to contracts. Reviewer approval (datacatalog.assets.approve).
Archived Soft-deleted. Hidden from default lists. Existing references show archived state. datacatalog.assets.write (with confirmation).

Permanent delete is allowed only on archived assets, requires confirmation, and warns about every dependent metric / contract / product / glossary link.

Tags and what they do

Tags are not just labels. The platform uses them in:

  • HERC — "find assets with PII tag" routes through tag search.
  • Logic Engine — alert rules can target all assets carrying tag X.
  • Workflows — workflow definitions can route reviews based on tags ("PII assets need extra approver").
  • Insights — KPI cards can filter "this dashboard's source assets that carry the finance tag".
  • DNA — tag co-occurrence is one of the signals the knowledge graph uses to cluster assets.

Tags ship as a system table — /admin/system-tables → Tags. Tenant-defined tags sit alongside the platform defaults; the platform defaults are not deletable.

Documents

Upload a PDF, paste markdown, or link a URL. Each document is:

  • Attached to one or more assets via a junction.
  • Chunked semantically by the AI (preferring paragraph and section boundaries) into ~500-token pieces.
  • Embedded with vector embeddings for semantic search.
  • Summarised at the document level so HERC can give a one-paragraph answer when asked.

Result: when a viewer opens an asset, the Documents tab is searchable, and HERC can answer "what does the customers table actually contain?" by reading the attached documents.

Document processing is asynchronous — large PDFs (>50 pages) can take a minute. Status is shown on the document row.

Linking assets to other entities

Link Where to set it up Why
Tag Asset detail → Tags chip selector Classify the asset for search, alerts, workflows.
Glossary term Asset detail → Terms tab → Add term (or from the term side: term detail → Linked assets) Bind a business definition to a physical dataset.
Data Product Asset detail → Register as Data Product Wrap the asset to put it on a contract.
Document Asset detail → Documents tab → Attach document Add a description / runbook / ERD.
Lineage upstream/downstream Inferred automatically when the asset is referenced by another entity (metric, alert, dashboard). The lineage tab shows the graph.

Roles

Role Capability
datacatalog.assets.read View assets, columns, documents, lineage.
datacatalog.assets.write Create / edit / archive assets, attach tags, attach documents.
datacatalog.assets.approve Approve published assets.
datacatalog.export.manage Configure Catalog export targets (e.g. push to a downstream catalog).

Default role groups: Catalog.Viewer, Catalog.Editor, Catalog.Approver, Catalog.Admin.

Limitations

Limit Why Workaround
The Catalog stores metadata, not data. We are not your warehouse — we describe it. Connect a warehouse via the Metadata Engine; the rows live there.
Lineage is inferred from explicit references. We don't parse arbitrary SQL files looking for joins. Define metrics on the join — that produces explicit lineage.
Documents are chunked by the platform, not by the author. Manual chunking doesn't scale across thousands of docs. Author sections clearly with headings; the chunker preserves structure.
Vector search depends on having an AI provider configured. Embeddings need a model. Configure a provider, or use plain text search.
Bulk-archive is gated by per-asset confirmation. Bulk archive of governed assets is a footgun. Run the archive in batches; the platform retains undo (restore) up until permanent delete.

Audit & compliance

Question a CISO might ask Where to look
"Which assets are tagged PII?" /data-catalog/assets filtered by tag PII.
"Who approved the customers asset publish?" Detail → audit history.
"How do we ensure new PII columns don't slip through?" Configure a workflow that routes assets carrying a PII column to an extra approver step in /admin/workflows.
"Where do uploaded documents live?" In your tenant's PostgreSQL datacatalog.documents (text + embeddings via pgvector). Files attached as binary blobs.
"Can we pull the catalog into our other tools?" Yes — configure an Export Target in /admin/export-targets. Standard formats supported (OpenLineage, ODCS).

Troubleshooting

Symptom Likely cause Fix
Bulk import preview shows "Validation failed" Required columns missing or status invalid. The preview shows per-row errors; fix in the file and re-upload.
Document embed status stays processing The AI provider is down or the file is huge. Check /admin/audit-log for datacatalog.document.embed_failed; retry from the document row.
Tag is missing from the selector It's not in the system table. Admin → /admin/system-tables → Tags → add.
Asset's Lineage tab is empty Nothing references this asset yet. Bind a metric, build an insight card, or attach an alert to populate lineage.
Submit for Review button greyed You don't have datacatalog.assets.write, or required fields are missing. Check role; complete description, owner, type.

See also

  • Business Glossary — terms that describe what assets mean.
  • Metadata Engine — discover schemas in a warehouse and snapshot them; promote to catalog assets.
  • Data Products — wrap an asset to put it on a contract.
  • Organisation DNA — assets are nodes in the knowledge graph.
  • HERC — natural-language search across the catalog.