Embedding Inference
Skardi can generate text embeddings inline in SQL via three different backends, and combine them with lance_knn() for end-to-end semantic search — query is embedded, nearest documents are retrieved, results stream back — all in one SQL statement, with no application code in between.
SELECT id, title, content, _distance
FROM lance_knn(
'doc_embeddings',
'embedding',
candle('models/generated/bge-small-en-v1.5', {query}), -- embed the query
10 -- top-k
)
ORDER BY _distance
LIMIT 10
The shape is always the same: an embedding UDF (candle, gguf, or remote_embed) takes the user's text, returns a List<Float32>, and lance_knn uses that vector to find the nearest pre-embedded documents in a Lance dataset.
Choosing a Backend
| Backend | UDF | When to use it | Build feature |
|---|---|---|---|
| Candle | candle(model_dir, text) | Local SafeTensors models (HuggingFace BERT/RoBERTa/DistilBERT/Jina). No network, no API key, fastest cold start once cached. Best default for self-hosted setups. | --features candle |
| GGUF | gguf(model_dir, text) | Local GGUF / llama.cpp models with quantisation (Q4/Q5/Q8/F16). Use when you want to run a larger model with a smaller memory footprint, or when the model only ships in GGUF format (e.g. embeddinggemma). | --features gguf |
| Remote | remote_embed(provider, model, text) | Hosted APIs: OpenAI, Gemini, Voyage, Mistral. Use when you want top-tier model quality without managing weights, GPUs, or memory. Requires an API key and pays per-token. | --features remote-embed |
All three backends produce List<Float32> and plug into lance_knn() the same way — switching backends is a one-line change in the pipeline SQL plus rebuilding the Lance dataset with the matching embedder.
Quick rule of thumb:
- Local + small model + simple deps → Candle.
- Local + larger model or quantised → GGUF.
- No GPU, no model management, willing to pay per call → Remote.
How It Works
- Offline (one-time setup) — A Python script in each backend's directory loads
docs/embeddings/data/docs.csv(a 15-document knowledge base shared across all three backends), embeds every row with the chosen model, and writes a Lance dataset to that backend'sdata/generated/directory. - At query time — The user POSTs
{"query": "...", "k": 10}to the pipeline endpoint. Skardi calls the embedding UDF on{query}, hands the resulting vector tolance_knn, and streams back the top-k nearest documents with their_distance. - Model caching — For the local backends (
candle,gguf), the model is loaded into memory on the first call and reused for every subsequent query. There is no per-request model load.
The shared corpus means you can run all three demos against the same source data and compare results side by side.
Inline Ingestion: Chunk → Embed → Write
The demos above pre-embed documents via Python because the source corpus is small and pre-chunked. For real ingest where each document is too large to embed as a single vector, combine chunk() with the embedding UDF and skip Python entirely:
INSERT INTO doc_chunks
SELECT
doc_id,
chunk_text,
candle('models/bge-small-en-v1.5', chunk_text) AS embedding
FROM (
SELECT doc_id, UNNEST(chunk('markdown', body, 1000, 200)) AS chunk_text
FROM raw_docs
);
Build with --features rag to get embedding UDFs plus the chunk UDF in one flag. See docs/chunk.md for full chunk() semantics, supported modes, and overlap behaviour.
Pipeline Shape
Every backend uses the same parameter shape so pipelines are interchangeable:
| Parameter | Type | Required | Description |
|---|---|---|---|
query | string | Yes | Free-text search query to embed and search with. |
k | integer | Yes | Number of nearest neighbours to return. |
Prerequisites Common to All Backends
- Python 3.12 for the setup scripts. Some backends (Candle) require 3.12 specifically because
onnxruntimehas no wheels for 3.13+. - Build
skardi-serverwith the matching feature — see the table above. You can enable multiple features at once (--features "candle gguf remote-embed") if you want all three backends in the same binary. - Run setup and
skardi-serverfrom the project root, so themodels/anddocs/embeddings/<backend>/data/generated/paths in the pipeline SQL resolve correctly.
Each backend's README has the exact pip install line, model download steps, and any backend-specific gotchas (Gemma licence acceptance for GGUF, API keys for Remote).
Directory Layout
docs/embeddings/
├── README.md — this file
├── data/
│ └── docs.csv — shared 15-document knowledge base
├── candle/
│ ├── README.md — Candle (local SafeTensors) walkthrough
│ ├── ctx.yaml
│ ├── setup.py
│ └── pipelines/
├── gguf/
│ ├── README.md — GGUF (local llama.cpp) walkthrough
│ ├── ctx.yaml
│ ├── setup_gguf.py
│ └── pipelines/
└── remote/
├── README.md — Remote API (OpenAI/Gemini/Voyage/Mistral) walkthrough
├── ctx.yaml
├── setup_remote.py
└── pipelines/
See Also
- docs/lance/ —
lance_knn()reference and indexing options for the vector store side of the pipeline. - docs/pipelines.md — pipeline YAML shape, parameter inference, and the HTTP endpoint contract.
- demo/rag/ — full RAG pipeline that uses one of these embedding backends end-to-end.