Catalog Semantics
A semantics overlay attaches natural-language descriptions to the tables and columns already registered through a context file. Both binaries consume it:
skardi-serverloads it at startup and emits the descriptions onGET /data_sourceso an agent can read them when picking a tool.skardi query --schemarenders the descriptions inline next to each table and column, for human inspection.
This page documents the YAML shape, how the loader finds it, the override / fallback rules, and where the descriptions surface.
Why
Raw column: Utf8 schemas are not enough for an agent to pick the right
tool. The model needs to know what the column holds — price_usd is
"retail price in USD", slug is the "URL-stable identifier", and so
on. Stuffing those descriptions inline in ctx.yaml works for a single
table but does not scale: catalog-mode sources have many tables, and
auto-generated descriptions (e.g. from the
auto_knowledge_base
skill) want their own file so they don't pollute hand-curated config.
A separate kind: semantics resource is the answer: same envelope as
context / pipelines / jobs, hot-pluggable at startup, freely composable
across multiple files.
File shape
kind: semantics
metadata:
name: basic-semantics
version: 1.0.0
spec:
sources:
- name: products # must match a data_sources[].name in the ctx
description: "Product catalog with pricing/inventory. One row per SKU."
columns:
- name: id
description: "Stable internal SKU; primary key."
- name: price
description: "Retail price in USD."
spec.sources[] is a flat list of overlays. The name field is either:
- A bare source name — cross-references
data_sources[].namefrom the ctx. For table-mode sources (CSV, Parquet, Lance, Iceberg…) where the source name is the physical table name, this is the only form you need. For catalog-mode sources (Postgres / MySQL / SQLite registered withhierarchy_level: catalog), the bare entry applies as a broad fallback to every inner table. - A fully-qualified DataFusion path
catalog.schema.table— targets one specific physical table. Useful for catalog-mode sources where a single ctx registration spawns many inner tables and you want per-inner-table descriptions.
spec:
sources:
# 1-part: broad fallback for the whole `mydb` catalog.
- name: mydb
description: "Internal application DB"
# 3-part: targets a specific inner table. Wins over the bare entry
# above for `mydb.public.users` only.
- name: mydb.public.users
description: "Auth + profile data, one row per registered account"
columns:
- name: id
description: "User ID (auth.users.id)"
- name: mydb.public.orders
description: "Submitted orders"
columns:
- name: id
description: "Order number, monotonic"
Names with anything other than 1 or 3 dot-separated segments are a
hard error (e.g. schema.table, a.b.c.d, or empty segments like
mydb..users). Semantics for an unknown source / catalog are warned
about (not failed) at load time so a stale overlay does not brick a
partially-rebooted server.
description and columns are both optional — supply only what you
have. Unknown columns are not reported (the merge runs at request time
against the live Arrow schema).
A complete worked example lives at
docs/basic/semantics.yaml, paired with the
existing docs/basic/ctx.yaml.
Loading
# server
skardi-server \
--ctx ctx.yaml \
--pipeline pipelines/ \
--semantics semantics/ \ # optional; auto-discovered next to ctx if omitted
--port 8080
# CLI
skardi query --ctx ctx.yaml --schema --all
skardi query --ctx ctx.yaml --schema --all --semantics ./custom/semantics.yaml
Both binaries follow the same resolution order:
- Explicit
--semantics <path>— used directly. Accepts either a single yaml file or a directory. - Auto-discovered
<ctx_dir>/semantics/(directory) — every*.yaml/*.ymlat one level is scanned, in alphabetical order. - Auto-discovered
<ctx_dir>/semantics.yaml(single file). - None — the catalog falls back to
data_sources[].descriptiononly (see Fallback below).
When --semantics points at a directory, files whose root kind: is
not semantics are silently skipped, so a single shared config
directory can mix pipeline / job / context / semantics yamls. A single
file passed explicitly with the wrong or missing kind is also a soft
skip — same behavior as --jobs.
Auto-discovery collision: defining both
<ctx_dir>/semantics/and<ctx_dir>/semantics.yamlis a hard error at startup. Pick one. Silent shadowing of overlays that drive an agent's catalog view is exactly the bug worth being loud about.
Composition rules
Multiple semantics files may be merged into one registry. Bare and qualified entries live in different addressing spaces — they're not duplicates of each other even when they describe the same physical table. The rules:
| Situation | Behavior |
|---|---|
Two files share the same key (same bare name:, or same catalog.schema.table) at table or column level | Hard error at startup. Both file paths are reported. |
One file has name: mydb, another has name: mydb.public.users | Both kept. The qualified entry wins for mydb.public.users; the bare entry covers every other inner table. |
name: has 0, 2, or 4+ dot-separated segments (or any empty segment) | Hard error at startup. |
| A file references an unknown source / catalog | Warning. The entry is kept in the registry but never matches. |
A file is named explicitly (--semantics file.yaml) and is missing kind: semantics | Soft skip — same as a non-semantics file in a directory scan. |
The duplicate-is-error rule keeps auto-generated overlays composable: each file owns its own slice of the catalog and never silently overwrites a sibling.
Fallback / precedence
data_sources[] in ctx.yaml already accepts a free-text description
field:
spec:
data_sources:
- name: products
type: csv
path: data/products.csv
description: "Product catalog dataset"
That value is the table-level ctx-inline fallback — used when no semantics overlay supplies one. Column-level descriptions have no ctx fallback; they live only in semantics files.
The merge precedence (most-specific wins):
- Qualified semantics overlay —
name: catalog.schema.table(table or column). - Bare semantics overlay —
name: <source>(table or column). data_sources[].description(table-level only).- None — the field is omitted from the JSON response.
Steps 1 and 2 cooperate: a qualified entry only covers what it addresses; the bare entry continues to apply to anything the qualified entry didn't touch. So writing both forms is the normal path for catalog-mode sources where most inner tables share a description but a few want their own.
Where it shows up
skardi query --schema
The CLI renders the merged view inline next to each table and column.
A -- separator carries the description; lines without an overlay or
fallback render bare, so existing scripts that parse the output keep
working.
$ skardi query --ctx ./ctx.yaml --schema --all
table: products -- Product catalog with pricing/inventory. One row per SKU.
id: Int64 -- Stable internal SKU; primary key.
brand: Utf8
price: Float64 -- Retail price in USD.
No flag is needed to opt in: if a kind: semantics overlay is
discovered (or data_sources[].description is set), the descriptions
appear automatically.
GET /data_source (server)
The catalog endpoint returns the merged view:
curl http://localhost:8080/data_source
{
"success": true,
"count": 1,
"data": [
{
"name": "products",
"type": "csv",
"path": "data/products.csv",
"tables": [
{
"name": "products",
"description": "Product catalog with pricing/inventory. One row per SKU.",
"schema": [
{ "name": "id", "type": "Int64", "nullable": false, "description": "Stable internal SKU; primary key." },
{ "name": "brand", "type": "Utf8", "nullable": false },
{ "name": "price", "type": "Float64", "nullable": false, "description": "Retail price in USD." }
]
}
]
}
],
"timestamp": "..."
}
description is omitted from the JSON when no overlay or fallback is
present, so the wire shape stays clean for sources that opt out.
Limitations
GET /data_sourcestill emits one table per data source (the source name is the table name in the JSON response). Catalog-mode sources expose many inner tables, but the HTTP endpoint doesn't enumerate them yet — so qualifiedcatalog.schema.tableoverlays defined for inner tables won't surface on the endpoint until the endpoint is extended. The CLI (skardi query --schema --all) does enumerate inner tables and renders qualified overlays correctly today.- There is no agent-callable
describeverb yet. Agents reach the semantics through the HTTP endpoint above; a pipeline form is a separate task on the roadmap.
Next
- Server — full flag reference and lifecycle.
- Catalog mode — registering an entire database as a DataFusion catalog.
- Spark for Agents — why this primitive exists.