Organisation DNA: the platform's knowledge graph¶

Organisation DNA is the platform's learned understanding of your data landscape. It reads everything Datahub knows — glossary terms, catalog assets, contracts, products, processes, events, policies — and builds a knowledge graph that powers conversational AI, semantic search, and exploration.

When HERC answers a vague question like "how does customer churn propagate through our products?", DNA is what makes the answer specific to your organisation rather than generic.

When to choose this¶

Open DNA when you want to:

Explore the connections in your platform. Click an entity → see its neighbours → click again → walk the graph.
Audit knowledge coverage. "Which entities does the AI know about?" — DNA shows the entity count, type breakdown, and source-module coverage.
Trigger a re-index after a big change. A bulk import, a new domain, a contract restructure — re-index so the AI catches up.
Diagnose vague AI answers. If HERC keeps missing things, the brain may be stale or thin. DNA shows the run history and the index size.
Run free-text semantic search. Ask "what entities relate to churn?" and get a ranked list across all modules.

You do not need this module to:

Make HERC work day-to-day — DNA runs in the background.
Browse the catalog — that's Data Catalog.
Add new terms — that's Business Glossary.

What DNA looks like¶

Surface	Where	What you see
Overview	`/dna`	Index status (running / done / failed), entity count, growth delta vs previous run, type distribution donut, source module coverage, run history, Re-index button.
Search	`/dna` → Search	Free-text input → ranked entity hits with scores.
Graph explorer	`/dna` → Graph	Interactive React Flow canvas: nodes coloured by type, edges show relationships, click-and-drag to pan, zoom to focus.
Entity detail	`/dna` → click entity	Name, type, description, embedding stats, neighbours list, source documents (term / asset / contract / …).

How DNA works¶

DNA is built on Microsoft GraphRAG. The pipeline:

Collect. A background job reads governed entities from across the platform — every glossary term, every catalog asset, every contract, every product, every process, every event, every policy. It chunks the text and adds it to a corpus.
Index. GraphRAG runs in a detached subprocess (so it survives uvicorn reloads) and produces a graph: extracted entities, relationships between them, and Leiden community clusters with hierarchical summaries.
ETL. The Parquet outputs are loaded into PostgreSQL — entities, relationships, communities, text units — so the platform can query the graph at low latency.
Serve. HERC's specialist agents read community summaries on every prompt; the deep "search the graph" tool is available on demand.

After the first run, subsequent runs are incremental — only new / changed entities are processed and merged into the existing graph. A weekly Procrastinate job (Sunday 02:00) checks the change threshold (50+ modified entities) and triggers a re-index automatically.

Setup — what an admin needs to do once¶

Prereq	Where	Why
AI provider key	`/admin/integrations` → AI Provider Keys	GraphRAG calls the provider for entity extraction + embeddings.
Roles	`/rolegroups`	`dna.read` to view the dashboard / search / graph; `dna.write` to trigger a re-index.
Initial index	`/dna` → Re-index	Built automatically on first start, but admins can trigger one manually after a large initial import.
(Optional) Private AI	License	The Private AI license routes completion to the platform model; embeddings still use your customer key for cost optimisation. See AI platform.

Triggering a re-index¶

/dna → Re-index. The button is gated on dna.write and is disabled if a run is already pending or running. The job is deferred to Procrastinate, so it survives a uvicorn restart between the API ack and the indexing kickoff.

When to trigger manually:

After a bulk import of glossary terms or catalog assets.
After publishing a new contract or restructuring an existing one.
After a major taxonomy change (renamed domains, new tag scheme).
If HERC's answers feel stale — check the Last successful run timestamp.

A run typically takes 1–10 minutes depending on corpus size and provider latency. Progress streams to the dashboard.

What's in the dashboard¶

Card	What it tells you
Status	Idle / pending / running / failed. If failed, click for the actionable error message.
Entity count	Total extracted entities; growth delta vs previous run.
Type distribution	Donut: which entity types dominate (e.g. concept, organisation, system, role).
Source module coverage	Bar chart: how many entities each module contributed (glossary, catalog, contracts, …). Use this to spot under-indexed modules.
Recent runs	Table: each run's status, duration, cost, entity / relationship counts.

The graph explorer¶

The graph view is a React Flow canvas:

Nodes are entities, coloured by type.
Edges are relationships extracted by GraphRAG (e.g. Customer → uses → Product).
Click a node to see its neighbours, type, description, and source.
Drag to pan, scroll to zoom, double-click an entity to focus on it.

The graph is intentionally not a full UML diagram — it's the AI's understanding of your business, useful for sanity-checking that the right concepts are connected.

How DNA powers HERC¶

Every conversation HERC has uses DNA in two ways:

Lightweight context injection. The community summaries (multi-level abstractions of your knowledge graph) are added to the agent prompt automatically. This makes HERC's answers organisation-specific without you having to think about it.
On-demand deep query. When a user asks something specific ("how does churn relate to subscription tier?"), HERC can call the brain.search MCP tool, which runs a hybrid semantic + keyword query against the graph and returns the most relevant entities, relationships, and source text.

This is why DNA quality directly affects answer quality — a thin or stale brain produces thin answers.

Limitations¶

Limit	Why	Workaround
Re-indexing has cost (provider calls).	Entity extraction + embeddings call the LLM.	The incremental update keeps cost down; full rebuilds are rare.
GraphRAG is one-graph-per-tenant.	Single workspace per tenant.	If you operate multiple business units, consider separate tenants.
Search recall is the platform's biggest open quality area.	Naïve substring strategy today; semantic embeddings are improving.	A reproducible benchmark harness is in place; recall improves with each release.
DNA doesn't index transcripts or documents yet.	Coming.	Tag transcripts with the right domain; the catalog handles document indexing for now.
The graph rebuilds from scratch only when explicitly requested.	Stable IDs across runs.	Re-index manually after a major restructure.

Audit & compliance¶

Question a CISO might ask	Where to look
"Where does indexing run?"	A detached subprocess in the platform's compute. The provider is called for embeddings + extraction.
"Are credentials stored on disk?"	No. The runtime config uses `${...}` placeholders; real credentials are passed via subprocess env vars.
"Can a viewer see entities they don't have access to?"	The graph is platform-wide, but specialist agents filter results by the user's role. The graph explorer is gated on `dna.read`.
"Where is the index stored?"	PostgreSQL (`dna.brain_*` tables) + a workspace folder for GraphRAG's internal state.
"Did any data leave the tenant?"	The provider chosen under `/admin/integrations`. With Private AI, completion stays on the platform; embeddings can still go to OpenAI for cost.
"Who triggered this re-index?"	Run history shows the user + timestamp.

Troubleshooting¶

Symptom	Likely cause	Fix
Index status stuck on Running after a server restart	Stale run from a crashed previous process.	Cleanup runs automatically on start; if not, click Retry indexing.
First indexing fails with "missing API key"	Provider key not configured.	`/admin/integrations` → AI Provider Keys.
Search returns nothing for a paraphrase	Recall on paraphrases is the known open area.	Add the literal terms to a glossary term description so the index picks them up.
Entity count flat after re-index	No qualifying changes since the last run.	Check the run history: did anything new land?
Re-index button greyed out	A run is pending or running.	Wait for it to finish; check the dashboard.
Graph explorer is empty	Brain hasn't been indexed yet.	Trigger a re-index.
"Server restarted during indexing" error message	Crash or OOM during a previous run.	Click Retry indexing. The startup cleanup writes this message; it's safe to re-run.