Skip to main content
Version: 0.4.0

Text Chunking

Build flags:

cargo build --release -p skardi-server --features chunking   # just chunk()
cargo build --release -p skardi-server --features rag # chunk() + embedding UDFs

The rag umbrella feature bundles embedding and chunking so the chunk → embed → write loop ships in one flag. The pre-built Docker image ghcr.io/skardilabs/skardi/skardi-server-rag:<tag> includes both.

chunk is a DataFusion scalar UDF that splits text into smaller pieces directly inside SQL, so document ingestion can chunk inline alongside embedding and writing — no out-of-band Python step. It wraps the text-splitter crate.

Function Signature

chunk(mode, text, size [, overlap]) -> List<Utf8>
ArgumentTypeDescription
modestring literalSplitter to use. Supported: 'character', 'markdown'.
textUtf8The text to split. May be a literal, a scalar subquery, or a column reference. NULL rows pass through unchanged.
sizeinteger literalTarget maximum chunk length in characters. Must be > 0.
overlapinteger literalOptional. Characters of overlap between adjacent chunks. Must be strictly less than size. Defaults to 0.

Returns: List<Utf8> — one element per chunk. A NULL input row produces a NULL list element.

Use UNNEST(chunk(...)) to expand the list into one row per chunk.

Modes

ModeBehavior
'character'Generic recursive splitter: tries paragraph → sentence → word → grapheme boundaries before falling back to character splits. Use for arbitrary plain text.
'markdown'Markdown-aware: prefers heading / paragraph / code-block boundaries so chunks stay semantically coherent. Use when the input is Markdown (READMEs, wiki pages, knowledge bases).

Both modes count length in characters (Unicode scalar values), not bytes or tokens.

Token-based and code-aware splitters are planned follow-ups. See the roadmap.

Examples

Split a literal

SELECT UNNEST(chunk('character', 'a long document body...', 1000)) AS piece;

Per-row chunking over a table

SELECT id, UNNEST(chunk('markdown', body, 1000, 200)) AS chunk_text
FROM docs;

Each row in docs produces one row per chunk. id repeats; chunk_text carries the split.

Inline ingestion: chunk → embed → write

The motivating use case — RAG ingest expressed as one SQL statement, no application code in between:

INSERT INTO doc_chunks
SELECT
doc_id,
chunk_text,
candle('models/bge-small-en-v1.5', chunk_text) AS embedding
FROM (
SELECT
doc_id,
UNNEST(chunk('markdown', body, 1000, 200)) AS chunk_text
FROM raw_docs
);

Requires --features rag (or --features "chunking,candle" if you want to pick à la carte).

Keep chunks as a list column

Skip the UNNEST to keep all chunks for a document together:

SELECT id, chunk('character', body, 500) AS chunks
FROM articles;

chunks is a List<Utf8> you can index, count with array_length, or expand later.

As a pipeline

params:
source_table: { type: string }
chunk_size: { type: integer, default: 1000 }
overlap: { type: integer, default: 200 }
model: { type: string, default: "models/bge-small-en-v1.5" }
sql: |
WITH split AS (
SELECT id,
UNNEST(chunk('markdown', body, ${chunk_size}, ${overlap})) AS text
FROM ${source_table}
)
SELECT id, text, candle('${model}', text) AS embedding
FROM split

Key Behaviors

  • Splitter is built once per batch. Per-row construction would be wasteful; the underlying TextSplitter / MarkdownSplitter is constructed once and reused for every row in the batch.
  • NULL text → NULL list element. Null inputs do not error and do not produce empty lists.
  • size and overlap must be literal integers. They are read once per call, not per row.
  • overlap < size is enforced. Equal or greater overlap returns an execution error rather than looping.
  • Subqueries work transparently. Because chunk is a scalar UDF, scalar subqueries collapse to a single value before invocation, and relational subqueries simply feed rows into the parent scan — no special syntax needed.

Troubleshooting

"chunk: unsupported mode '<x>'" — only 'character' and 'markdown' are supported today.

"chunk: 'overlap' (N) must be strictly less than 'size' (M)" — pick overlap < size. Setting overlap = 0 is always valid.

"chunk expects 3 or 4 arguments" — the call shape is chunk(mode, text, size [, overlap]).

Long literal embedded into SQL fails to parse — large text passed as a SQL string literal must escape single quotes. Pass via a parameter / scalar subquery / column reference for anything beyond toy examples.