HERC ↔ Databricks Genie¶
HERC can now answer natural-language questions about data living in your Databricks workspace by handing the question off to a Databricks Genie space and weaving the result back into the chat. Use it when the answer lives in a warehouse table (revenue, customers, orders, pipeline, churn) — not when it lives in your Datahub catalog, glossary, or contracts.
What it looks like for the user¶
You: What were our top 10 customers by revenue last quarter?
HERC: I asked the Sales Analytics Genie space. Acme Corp led at €842 k, followed by Globex (€640 k) and Initech (€515 k). Here's the top 10:
Customer Revenue (€) Acme Corp 842,000 Globex 640,000 … … Want me to compare this to the previous quarter, or slice it by region?
A follow-up like "now split by region" re-uses the same Genie conversation, so the SQL context (the previous quarter's filter, the revenue measure) carries over. The accuracy difference vs. starting a fresh question is large — let HERC chain follow-ups whenever the user is iterating.
When to use it (and when not to)¶
Use HERC's Databricks data path when:
- The question is about warehouse data — sales, finance, marketing funnels, anything in your Unity Catalog tables.
- You want the answer to respect each user's UC permissions — Genie queries run under the calling user's Databricks identity via the same
oauth_u2mflow used by embedded dashboards. Users only see data they would see if they opened the Databricks UI themselves. - The question is bounded to a Genie space's tables. Genie is not a general SQL editor — it shines when the space scope is narrow (10-20 tables, well described).
Use other HERC capabilities when:
- The question is about governance metadata (which terms exist, who owns this asset, what's in this data product) — use the regular HERC catalog / glossary / contracts agents. They are much faster than a Genie round-trip.
- You need detail beyond ~10 rows — HERC's tool returns at most 10 sample rows so the chat transcript stays readable. For larger result sets, open the space in Databricks and run the SQL Genie suggested.
- You need to modify data — Genie is read-only and so is HERC's integration with it.
Setup (admin)¶
The integration has two prerequisites:
- A Databricks workspace integration in Connected apps that uses User-to-Machine OAuth (
oauth_u2m). The existing setup at Admin → Connected apps → Databricks is exactly what is needed — see Databricks: per-user authentication (U2M OAuth) if you have not yet connected a workspace. - At least one Genie space registered in Datahub. Spaces themselves are authored in Databricks — Datahub does not create them. Datahub keeps a local registry of which spaces HERC is allowed to use, who's already curated them, and what they cover.
Sync Genie spaces from Databricks¶
- Go to Administration → Genie spaces in Datahub. The page is gated by the
databricksgenie.managerole. - Each connected Databricks workspace shows up as a Sync {workspace name} button at the top of the page. Click it.
- Datahub queries Databricks for every Genie space accessible to your workspace identity (
list_spaces), then fetches the metadata for each (get_spacewithinclude_serialized_space=True), and upserts them into the registry. The toast at the end tells you how many spaces were discovered, created (new), and updated (already in the registry — refreshed table list / sample questions). - New spaces appear in the table as Active by default.
Sync is manual. We do this on purpose — newly-created Genie spaces should be reviewed by an admin (and optionally given a description override) before HERC starts steering questions at them. Re-syncing is idempotent and safe to run as often as you like; description overrides you've made in Datahub are preserved across syncs.
Curate the registry¶
For each space row, click the pencil icon to:
- Override the description HERC sees. HERC's LLM picks a space based on the title, description, and table list. If two spaces overlap (e.g. one for "Sales — historical" and one for "Sales — live"), a short, opinionated description ("Use this space for closed deals only; the live space has open opportunities.") will dramatically improve routing accuracy.
- Toggle Active off. Inactive spaces are hidden from HERC entirely — they don't show up in
list_genie_spacesand they cannot be the target of anaskcall. Use this to retire spaces or to silence noisy ones during incident response without deleting them from the registry.
The description override is preserved across re-syncs. Toggling Active off does not delete the row.
Roles¶
Two roles ship with the integration:
| Role | What it grants |
|---|---|
databricksgenie.read |
List Genie spaces from the Datahub registry (used by HERC's tools). Bundled into the DatabricksGenie.Viewer role group, along with the platform baselines herc.read and tasks.read. |
databricksgenie.manage |
Sync spaces from Databricks and edit registry overrides. Bundled into the DatabricksGenie.Manager role group with the same baselines. |
Administrators inherit both. Grant DatabricksGenie.Viewer to anyone who should be able to ask HERC questions backed by Genie; grant DatabricksGenie.Manager to a smaller group of stewards.
What end users need¶
End users do not need access to the admin Genie spaces page. They need:
- The DatabricksGenie.Viewer role (or admin-equivalent).
- To have completed the per-user Databricks consent for the workspace whose Genie spaces they are asking about. If they have not, HERC will respond with a message asking them to connect their Databricks account — and the platform will create a high-priority Connect to Databricks task in their inbox automatically. Once they consent, every subsequent HERC question backed by Genie will run as them.
See Databricks: per-user authentication (U2M OAuth) for the consent flow.
How HERC routes Databricks questions¶
HERC's router uses two signals to decide whether a question should go to Genie:
- Fast routing on keywords. Phrases like "ask databricks", "from databricks", "in the warehouse", "which genie space", and the Dutch equivalents (
uit databricks,in onze data,vraag aan databricks) route directly to the Databricks data specialist with no LLM classifier call. - LLM orchestration when the question is ambiguous. A bare "show me top customers by revenue" could go to the Insights agent (which edits dashboard cards) or the Databricks data agent (which actually queries the warehouse). The orchestration layer's LLM resolves which agent owns the turn based on the surrounding conversation.
The Databricks data specialist also has access to the same tools when the generalist datahub agent is selected — so even if routing picks the generalist, HERC can still reach Genie.
Limitations and known trade-offs¶
- Sync is manual. Newly-created Genie spaces are invisible to HERC until an admin clicks Sync. We chose this so admins can review spaces before HERC steers questions at them; we may add a nightly auto-sync in a later release.
- Tool result row cap. HERC returns at most 10 sample rows from each Genie answer. For larger result sets, users should open the space directly in Databricks and use the SQL Genie produced.
- No write-back. HERC's Genie integration is read-only. HERC cannot create, modify, or delete Genie spaces from Datahub.
- No new Genie spaces from Datahub. Spaces are authored in Databricks; Datahub keeps a registry, not a builder.
- 90-second wall-clock cap per question. Most Genie answers complete in 10-60 s. Beyond 90 s HERC times out and tells the user — the chat context cost dominates the value of waiting longer. Heavy queries should run interactively in Databricks.
Troubleshooting¶
| Symptom | What's happening | What to do |
|---|---|---|
| HERC says "No Genie spaces are registered yet." | No rows in the Datahub registry. | Sync from Administration → Genie spaces, then retry. |
| HERC says "You need to connect your Databricks account first." | The asking user has not yet consented for the workspace that owns the Genie space. | Open the Connect to Databricks task that just appeared in your inbox, or visit /databricks/consent/{integration-id}. |
| HERC says "I don't see a Genie space covering …" | The registry has spaces, but none of their descriptions or tables match the question's domain. | Ask an admin to either edit a space's description override to broaden HERC's understanding, or to register a new Genie space in Databricks covering that domain and sync it in. |
| Genie answer takes > 90 s and times out | Cold warehouse, pathological query, or large scan. | Retry once (warehouses cache after the first run). For repeated timeouts, simplify the question or run it directly in Databricks. |
Genie returns a FAILED status with a SQL error verbatim |
Genie could not produce a valid query. | HERC surfaces the error verbatim — the user can rephrase, or an admin can refine the space's description / sample questions in Databricks so future questions land cleaner. |
See also¶
- Databricks: per-user authentication (U2M OAuth) — the consent flow Genie relies on
- HERC — the chat surface that calls Genie
- Connected apps — where Databricks workspaces are configured