Version: 0.4.0

Embedding Inference

Skardi can generate text embeddings inline in SQL via three different backends, and combine them with lance_knn() for end-to-end semantic search — query is embedded, nearest documents are retrieved, results stream back — all in one SQL statement, with no application code in between.

SELECT id, title, content, _distance
FROM lance_knn(
  'doc_embeddings',
  'embedding',
  candle('models/generated/bge-small-en-v1.5', {query}),  -- embed the query
  10                                                       -- top-k
)
ORDER BY _distance
LIMIT 10

The shape is always the same: an embedding UDF (candle, gguf, or remote_embed) takes the user's text, returns a List<Float32>, and lance_knn uses that vector to find the nearest pre-embedded documents in a Lance dataset.

Choosing a Backend

Backend	UDF	When to use it	Build feature
Candle	`candle(model_dir, text)`	Local SafeTensors models (HuggingFace BERT/RoBERTa/DistilBERT/Jina). No network, no API key, fastest cold start once cached. Best default for self-hosted setups.	`--features candle`
GGUF	`gguf(model_dir, text)`	Local GGUF / llama.cpp models with quantisation (Q4/Q5/Q8/F16). Use when you want to run a larger model with a smaller memory footprint, or when the model only ships in GGUF format (e.g. `embeddinggemma`).	`--features gguf`
Remote	`remote_embed(provider, model, text)`	Hosted APIs: OpenAI, Gemini, Voyage, Mistral. Use when you want top-tier model quality without managing weights, GPUs, or memory. Requires an API key and pays per-token.	`--features remote-embed`

All three backends produce List<Float32> and plug into lance_knn() the same way — switching backends is a one-line change in the pipeline SQL plus rebuilding the Lance dataset with the matching embedder.

Quick rule of thumb:

Local + small model + simple deps → Candle.
Local + larger model or quantised → GGUF.
No GPU, no model management, willing to pay per call → Remote.

How It Works

Offline (one-time setup) — A Python script in each backend's directory loads docs/embeddings/data/docs.csv (a 15-document knowledge base shared across all three backends), embeds every row with the chosen model, and writes a Lance dataset to that backend's data/generated/ directory.
At query time — The user POSTs {"query": "...", "k": 10} to the pipeline endpoint. Skardi calls the embedding UDF on {query}, hands the resulting vector to lance_knn, and streams back the top-k nearest documents with their _distance.
Model caching — For the local backends (candle, gguf), the model is loaded into memory on the first call and reused for every subsequent query. There is no per-request model load.

The shared corpus means you can run all three demos against the same source data and compare results side by side.

Inline Ingestion: Chunk → Embed → Write

The demos above pre-embed documents via Python because the source corpus is small and pre-chunked. For real ingest where each document is too large to embed as a single vector, combine chunk() with the embedding UDF and skip Python entirely:

INSERT INTO doc_chunks
SELECT
  doc_id,
  chunk_text,
  candle('models/bge-small-en-v1.5', chunk_text) AS embedding
FROM (
  SELECT doc_id, UNNEST(chunk('markdown', body, 1000, 200)) AS chunk_text
  FROM raw_docs
);

Build with --features rag to get embedding UDFs plus the chunk UDF in one flag. See docs/chunk.md for full chunk() semantics, supported modes, and overlap behaviour.

Pipeline Shape

Every backend uses the same parameter shape so pipelines are interchangeable:

Parameter	Type	Required	Description
`query`	string	Yes	Free-text search query to embed and search with.
`k`	integer	Yes	Number of nearest neighbours to return.

Prerequisites Common to All Backends

Python 3.12 for the setup scripts. Some backends (Candle) require 3.12 specifically because onnxruntime has no wheels for 3.13+.
Build skardi-server with the matching feature — see the table above. You can enable multiple features at once (--features "candle gguf remote-embed") if you want all three backends in the same binary.
Run setup and skardi-server from the project root, so the models/ and docs/embeddings/<backend>/data/generated/ paths in the pipeline SQL resolve correctly.

Each backend's README has the exact pip install line, model download steps, and any backend-specific gotchas (Gemma licence acceptance for GGUF, API keys for Remote).

Directory Layout

docs/embeddings/
├── README.md                 — this file
├── data/
│   └── docs.csv              — shared 15-document knowledge base
├── candle/
│   ├── README.md             — Candle (local SafeTensors) walkthrough
│   ├── ctx.yaml
│   ├── setup.py
│   └── pipelines/
├── gguf/
│   ├── README.md             — GGUF (local llama.cpp) walkthrough
│   ├── ctx.yaml
│   ├── setup_gguf.py
│   └── pipelines/
└── remote/
    ├── README.md             — Remote API (OpenAI/Gemini/Voyage/Mistral) walkthrough
    ├── ctx.yaml
    ├── setup_remote.py
    └── pipelines/

Choosing a Backend​

How It Works​

Inline Ingestion: Chunk → Embed → Write​

Pipeline Shape​

Prerequisites Common to All Backends​

Directory Layout​

See Also​