Skip to content

HERC ↔ Databricks Genie

HERC can now answer natural-language questions about data living in your Databricks workspace by handing the question off to a Databricks Genie space and weaving the result back into the chat. Use it when the answer lives in a warehouse table (revenue, customers, orders, pipeline, churn) — not when it lives in your Datahub catalog, glossary, or contracts.

What it looks like for the user

You: What were our top 10 customers by revenue last quarter?

HERC: I asked the Sales Analytics Genie space. Acme Corp led at €842 k, followed by Globex (€640 k) and Initech (€515 k). Here's the top 10:

Customer Revenue (€)
Acme Corp 842,000
Globex 640,000

Want me to compare this to the previous quarter, or slice it by region?

A follow-up like "now split by region" re-uses the same Genie conversation, so the SQL context (the previous quarter's filter, the revenue measure) carries over. The accuracy difference vs. starting a fresh question is large — let HERC chain follow-ups whenever the user is iterating.

When to use it (and when not to)

Use HERC's Databricks data path when:

  • The question is about warehouse data — sales, finance, marketing funnels, anything in your Unity Catalog tables.
  • You want the answer to respect each user's UC permissions — Genie queries run under the calling user's Databricks identity via the same oauth_u2m flow used by embedded dashboards. Users only see data they would see if they opened the Databricks UI themselves.
  • The question is bounded to a Genie space's tables. Genie is not a general SQL editor — it shines when the space scope is narrow (10-20 tables, well described).

Use other HERC capabilities when:

  • The question is about governance metadata (which terms exist, who owns this asset, what's in this data product) — use the regular HERC catalog / glossary / contracts agents. They are much faster than a Genie round-trip.
  • You need detail beyond ~10 rows — HERC's tool returns at most 10 sample rows so the chat transcript stays readable. For larger result sets, open the space in Databricks and run the SQL Genie suggested.
  • You need to modify data — Genie is read-only and so is HERC's integration with it.

Setup (admin)

The integration has two prerequisites:

  1. A Databricks workspace integration in Connected apps that uses User-to-Machine OAuth (oauth_u2m). The existing setup at Admin → Connected apps → Databricks is exactly what is needed — see Databricks: per-user authentication (U2M OAuth) if you have not yet connected a workspace.
  2. At least one Genie space registered in Datahub. Spaces themselves are authored in Databricks — Datahub does not create them. Datahub keeps a local registry of which spaces HERC is allowed to use, who's already curated them, and what they cover.

Sync Genie spaces from Databricks

  1. Go to Administration → Genie spaces in Datahub. The page is gated by the databricksgenie.manage role.
  2. Each connected Databricks workspace shows up as a Sync {workspace name} button at the top of the page. Click it.
  3. Datahub queries Databricks for every Genie space accessible to your workspace identity (list_spaces), then fetches the metadata for each (get_space with include_serialized_space=True), and upserts them into the registry. The toast at the end tells you how many spaces were discovered, created (new), and updated (already in the registry — refreshed table list / sample questions).
  4. New spaces appear in the table as Active by default.

Sync is manual. We do this on purpose — newly-created Genie spaces should be reviewed by an admin (and optionally given a description override) before HERC starts steering questions at them. Re-syncing is idempotent and safe to run as often as you like; description overrides you've made in Datahub are preserved across syncs.

Curate the registry

For each space row, click the pencil icon to:

  • Override the description HERC sees. HERC's LLM picks a space based on the title, description, and table list. If two spaces overlap (e.g. one for "Sales — historical" and one for "Sales — live"), a short, opinionated description ("Use this space for closed deals only; the live space has open opportunities.") will dramatically improve routing accuracy.
  • Toggle Active off. Inactive spaces are hidden from HERC entirely — they don't show up in list_genie_spaces and they cannot be the target of an ask call. Use this to retire spaces or to silence noisy ones during incident response without deleting them from the registry.

The description override is preserved across re-syncs. Toggling Active off does not delete the row.

Roles

Two roles ship with the integration:

Role What it grants
databricksgenie.read List Genie spaces from the Datahub registry (used by HERC's tools). Bundled into the DatabricksGenie.Viewer role group, along with the platform baselines herc.read and tasks.read.
databricksgenie.manage Sync spaces from Databricks and edit registry overrides. Bundled into the DatabricksGenie.Manager role group with the same baselines.

Administrators inherit both. Grant DatabricksGenie.Viewer to anyone who should be able to ask HERC questions backed by Genie; grant DatabricksGenie.Manager to a smaller group of stewards.

What end users need

End users do not need access to the admin Genie spaces page. They need:

  1. The DatabricksGenie.Viewer role (or admin-equivalent).
  2. To have completed the per-user Databricks consent for the workspace whose Genie spaces they are asking about. If they have not, HERC will respond with a message asking them to connect their Databricks account — and the platform will create a high-priority Connect to Databricks task in their inbox automatically. Once they consent, every subsequent HERC question backed by Genie will run as them.

See Databricks: per-user authentication (U2M OAuth) for the consent flow.

How HERC routes Databricks questions

HERC's router uses two signals to decide whether a question should go to Genie:

  • Fast routing on keywords. Phrases like "ask databricks", "from databricks", "in the warehouse", "which genie space", and the Dutch equivalents (uit databricks, in onze data, vraag aan databricks) route directly to the Databricks data specialist with no LLM classifier call.
  • LLM orchestration when the question is ambiguous. A bare "show me top customers by revenue" could go to the Insights agent (which edits dashboard cards) or the Databricks data agent (which actually queries the warehouse). The orchestration layer's LLM resolves which agent owns the turn based on the surrounding conversation.

The Databricks data specialist also has access to the same tools when the generalist datahub agent is selected — so even if routing picks the generalist, HERC can still reach Genie.

Limitations and known trade-offs

  • Sync is manual. Newly-created Genie spaces are invisible to HERC until an admin clicks Sync. We chose this so admins can review spaces before HERC steers questions at them; we may add a nightly auto-sync in a later release.
  • Tool result row cap. HERC returns at most 10 sample rows from each Genie answer. For larger result sets, users should open the space directly in Databricks and use the SQL Genie produced.
  • No write-back. HERC's Genie integration is read-only. HERC cannot create, modify, or delete Genie spaces from Datahub.
  • No new Genie spaces from Datahub. Spaces are authored in Databricks; Datahub keeps a registry, not a builder.
  • 90-second wall-clock cap per question. Most Genie answers complete in 10-60 s. Beyond 90 s HERC times out and tells the user — the chat context cost dominates the value of waiting longer. Heavy queries should run interactively in Databricks.

Troubleshooting

Symptom What's happening What to do
HERC says "No Genie spaces are registered yet." No rows in the Datahub registry. Sync from Administration → Genie spaces, then retry.
HERC says "You need to connect your Databricks account first." The asking user has not yet consented for the workspace that owns the Genie space. Open the Connect to Databricks task that just appeared in your inbox, or visit /databricks/consent/{integration-id}.
HERC says "I don't see a Genie space covering …" The registry has spaces, but none of their descriptions or tables match the question's domain. Ask an admin to either edit a space's description override to broaden HERC's understanding, or to register a new Genie space in Databricks covering that domain and sync it in.
Genie answer takes > 90 s and times out Cold warehouse, pathological query, or large scan. Retry once (warehouses cache after the first run). For repeated timeouts, simplify the question or run it directly in Databricks.
Genie returns a FAILED status with a SQL error verbatim Genie could not produce a valid query. HERC surfaces the error verbatim — the user can rephrase, or an admin can refine the space's description / sample questions in Databricks so future questions land cleaner.

See also