Skip to main content
Version: 0.2.0

Lance

This guide covers Skardi's integration with Lance for high-performance vector similarity search and BM25-scored full-text search. It demonstrates:

  • Native ANN (Approximate Nearest Neighbor) search using Lance's Scanner.nearest() API
  • Explicit KNN search via the lance_knn table function
  • Full-text search via the lance_fts table function with inverted indexes
  • Seamless SQL integration for both vector and text search queries

Datasets

Skardi ships with sample Lance datasets under data/:

vec_data.lance

General-purpose vector embeddings for similarity search:

  • id: int64 - Unique identifier
  • vector: fixed_size_list<float>[128] - 128-dimensional embedding vector
  • item_id: int64 - Reference to associated item
  • revenue: double - Revenue associated with the item

test_data.lance

Text dataset with INVERTED FTS index for full-text search:

  • id: int64 - Unique identifier
  • vector: fixed_size_list<float>[128] - 128-dimensional embedding vector
  • item_id: int64 - Reference to associated item
  • revenue: double - Revenue associated with the item
  • description: string - Text description (has INVERTED index with positions)
  • category: string - Category label

To regenerate: python scripts/prepare_fts_test_data.py

movie_embeddings.lance

Movie embeddings for recommendation pipelines:

  • movie_id: int64 - Movie identifier
  • embedding: fixed_size_list<float>[128] - 128-dimensional movie embedding

Components

FileDescription
ctx_lance.yamlContext file registering vec_data.lance and test_data.lance
pipelines/pipeline_lance.yamlKNN similarity search pipeline
pipelines/pipeline_lance_fts.yamlFull-text search pipeline

How It Works

lance_knn Table Function

Use the lance_knn table function for explicit KNN search:

SELECT * FROM lance_knn(table_name, vector_column, query_vector, k, [filter])

Parameters:

  • table_name: Name of the Lance table (string)
  • vector_column: Name of the embedding column (string)
  • query_vector: Query vector as literal array or scalar subquery
  • k: Number of nearest neighbors to retrieve from the ANN index (integer)
  • filter: Optional Lance filter predicate (string)

Both lance_knn and lance_fts support standard SQL clauses:

  • WHERE — predicates are pushed down to Lance for efficient metadata filtering
  • LIMIT — applied after search + filtering to cap the final result set
  • Column projection — only requested columns are returned

Filter Pushdown

WHERE clause predicates are pushed down to Lance, so you can combine KNN search with metadata filters:

-- Find 50 nearest neighbors, filter to electronics, return top 5
SELECT id, category, _distance
FROM lance_knn('items', 'vector', (SELECT vector FROM items WHERE id = 1), 50)
WHERE category = 'electronics'
LIMIT 5

Note: k and LIMIT serve different purposes. k controls how many ANN candidates Lance retrieves from the index. LIMIT truncates the final result set after filtering. When using WHERE filters, set k higher than your desired result count to ensure enough candidates survive filtering.

Query Execution

The table function directly calls Lance's KNN API:

lance_knn(...) -> LanceKnnExec -> Lance Scanner.nearest()

Performance Benefits

  • Without optimization: O(N * D + N log N) - full scan + sort
  • With Lance KNN: O(k log N) - index-based ANN search
  • Typical speedup: 10x-1000x for datasets with N > 100K vectors
Dataset SizeWithout OptimizationWith Lance KNNSpeedup
10K vectors~50ms~5ms10x
100K vectors~500ms~8ms62x
1M vectors~5000ms~15ms333x
10M vectors~50000ms~25ms2000x

Benchmarks: 128-dim vectors, k=10, IVF-PQ index, Intel Core i9

Running the Example

Start the Server

cargo run --bin skardi-server -- \
--ctx docs/lance/ctx_lance.yaml \
--pipeline docs/lance/pipelines/ \
--port 8080

Expected output:

Starting Skardi Online Serving Pipeline Server
CLI Arguments parsed successfully
Pipeline file: Some("docs/lance/pipeline_lance.yaml")
Context file: Some("docs/lance/ctx_lance.yaml")
Port: 8080
Server configuration loaded successfully
Pipeline: lance-vector-similarity-search
Data sources: 1
Server listening on 0.0.0.0:8080

Example 1: Find Similar Items

Find items most similar to item with id=1:

curl -X POST http://localhost:8080/lance-vector-similarity-search/execute \
-H "Content-Type: application/json" \
-d '{
"reference_id": 1,
"k": 50,
"min_revenue": null,
"max_revenue": null
}' | jq .

Response:

{
"success": true,
"data": [
{
"id": 42,
"item_id": 1337,
"revenue": 2500.50,
"distance": 0.125
},
{
"id": 89,
"item_id": 2048,
"revenue": 1800.25,
"distance": 0.187
}
],
"rows": 10,
"execution_time_ms": 8
}

Find similar items with revenue constraints:

curl -X POST http://localhost:8080/lance-vector-similarity-search/execute \
-H "Content-Type: application/json" \
-d '{
"reference_id": 1,
"k": 50,
"min_revenue": 1000.0,
"max_revenue": 5000.0
}' | jq .

Instead of looking up a reference vector by ID, you can pass the query vector directly in the request. This is useful when you already have an embedding from an external model.

Pipeline: pipelines/pipeline_lance_direct_vector.yaml

curl -X POST http://localhost:8080/lance-direct-vector-search/execute \
-H "Content-Type: application/json" \
-d '{
"query_vector": [0.0, 16.0, 35.0, 5.0, 32.0, ...],
"k": 10
}' | jq .

The query_vector parameter accepts a JSON array of floats matching the vector dimension of the dataset (128 for the sample dataset). The server converts the array to a SQL literal for lance_knn.

A test script is provided to read a real vector from the dataset and send it:

python docs/lance/test_direct_vector_search.py

KNN Pipeline Parameters

ParameterTypeRequiredDescription
reference_idintegerYesID of the reference item to find similar items for
kintegerYesNumber of nearest neighbours to retrieve from the ANN index
min_revenuedoubleNoMinimum revenue filter via WHERE pushdown (null = no filter)
max_revenuedoubleNoMaximum revenue filter via WHERE pushdown (null = no filter)

SQL Query Patterns

Find similar items using a subquery for the reference vector:

SELECT knn.id, knn.item_id, knn._distance as dist
FROM lance_knn(
'sift_items',
'vector',
(SELECT vector FROM sift_items WHERE id = 1),
10
) knn
WHERE knn.id != 1

Using CTE for Reference Vector

WITH ref AS (
SELECT vector FROM sift_items WHERE id = 1
)
SELECT knn.id, knn.item_id, knn._distance
FROM lance_knn('sift_items', 'vector', (SELECT vector FROM ref), 10) knn
WHERE knn.id != 1

With Inline Filter Parameter

The optional 5th argument applies a Lance filter predicate during ANN retrieval:

SELECT *
FROM lance_knn(
'sift_items',
'vector',
(SELECT vector FROM sift_items WHERE id = 1),
10,
'revenue > 1000'
)

With WHERE Clause Filter Pushdown

WHERE clause predicates are pushed down to Lance for efficient post-retrieval filtering:

SELECT id, category, _distance
FROM lance_knn(
'sift_items',
'vector',
(SELECT vector FROM sift_items WHERE id = 1),
50
)
WHERE category = 'electronics' AND revenue > 1000
LIMIT 10

Combining Inline Filter and WHERE Clause

Both filters are combined with AND:

SELECT id, category, _distance
FROM lance_knn(
'sift_items',
'vector',
(SELECT vector FROM sift_items WHERE id = 1),
50,
'revenue > 500'
)
WHERE category = 'electronics'
LIMIT 5

Movie Recommendation (Federated: Lance + PostgreSQL)

The pipeline_movie_recommendation.yaml demonstrates a more advanced use case — finding similar movies via Lance KNN, then ranking them with an ONNX model, and joining with a PostgreSQL movies table for metadata:

WITH knn_results AS (
SELECT knn.movie_id
FROM lance_knn(
'movie_embeddings',
'embedding',
(SELECT embedding FROM movie_embeddings WHERE movie_id = (SELECT movie_id FROM movies WHERE title = {last_watched_movie})),
10
) knn
)
SELECT m.title, m.genres, rr.prediction_score
FROM ranked_recommendations rr
JOIN movies m ON rr.movie_id = m.movie_id
ORDER BY rr.prediction_score DESC
LIMIT {top_n}

lance_fts Table Function

Use the lance_fts table function for BM25-scored full-text search:

SELECT * FROM lance_fts(table_name, text_column, query, limit)

Parameters:

  • table_name: Name of the Lance table with an INVERTED index (string)
  • text_column: Name of the text column to search (string)
  • query: Search query string (string)
  • limit: Maximum number of results (integer)

Results include a _score column (Float32) where higher values = more relevant.

Query Syntax

SyntaxTypeDescription
foo barTerm (OR)Matches documents containing any term
+foo barTerm (AND)All terms must be present
"foo bar"PhraseExact phrase match (requires index with positions)
foo~ / foo~2FuzzyTypo-tolerant matching
+foo -barBooleanMUST contain foo, MUST NOT contain bar
+foo barBooleanMUST contain foo, SHOULD contain bar

FTS Filter Pushdown

WHERE clause predicates are pushed down to Lance for efficient metadata filtering:

SELECT id, description, _score
FROM lance_fts('fts_data', 'description', 'premium wireless', 10)
WHERE category = 'electronics' AND revenue > 1000
LIMIT 5

Running the FTS Example

cargo run --bin skardi-server -- \
--ctx docs/lance/ctx_lance.yaml \
--pipeline docs/lance/pipelines/ \
--port 8080
curl -X POST http://localhost:8080/lance-full-text-search/execute \
-H "Content-Type: application/json" \
-d '{
"search_query": "premium wireless",
"category": null,
"limit": 10
}' | jq .

Response:

{
"success": true,
"data": [
{
"id": 0,
"item_id": 42,
"revenue": 1200.50,
"description": "A premium wireless organic speaker charger for electronics enthusiasts.",
"category": "electronics",
"relevance": 4.25
}
],
"rows": 10,
"execution_time_ms": 5
}

Example: Search with Category Filter

curl -X POST http://localhost:8080/lance-full-text-search/execute \
-H "Content-Type: application/json" \
-d '{
"search_query": "umbrella",
"category": "outdoor",
"limit": 5
}' | jq .
curl -X POST http://localhost:8080/lance-full-text-search/execute \
-H "Content-Type: application/json" \
-d '{
"search_query": "\"train to boston\"",
"category": null,
"limit": 10
}' | jq .

Creating an INVERTED Index

To use lance_fts, your Lance dataset needs an INVERTED index on the text column. Create one using the Python SDK:

import lance

ds = lance.dataset("data/my_dataset.lance")
ds.create_scalar_index(
"text_column",
index_type="INVERTED",
with_position=True, # Required for phrase search
)

See scripts/prepare_fts_test_data.py for a complete example.

FTS Pipeline Parameters

ParameterTypeRequiredDescription
search_querystringYesSearch query (supports term, phrase, fuzzy, boolean syntax)
categorystringNoCategory filter (null = no filter)
limitintegerYesMaximum number of results

Troubleshooting FTS

"No results returned"

Ensure the Lance dataset has an INVERTED index on the target column. Without an index, FTS queries return empty results.

"Phrase search returns unexpected results"

Verify the INVERTED index was created with with_position=True. Without positions, phrase queries fall back to term matching.

Creating Your Own Vector Search Pipeline

1. Create Context Configuration

data_sources:
- name: "my_vectors"
type: "lance"
path: "data/my_vectors.lance/"
description: "My vector embeddings"

2. Create Pipeline Configuration

metadata:
name: my-vector-search
version: 1.0.0

query: |
SELECT
knn.id,
knn.item_id,
knn._distance as similarity
FROM lance_knn(
'my_vectors',
'vector',
(SELECT vector FROM my_vectors WHERE id = {query_id}),
{k}
) knn
WHERE knn.id != {query_id}

3. Run Your Pipeline

cargo run --bin skardi-server -- \
--ctx ctx_my_vectors.yaml \
--pipeline pipeline_my_search.yaml \
--port 8080

Monitoring Execution

Enable debug logs to see KNN execution:

RUST_LOG=debug cargo run --bin skardi-server -- \
--ctx docs/lance/ctx_lance.yaml \
--pipeline docs/lance/pipelines/ \
--port 8080

Look for:

INFO  source::lance::knn_table_function: Registering Lance table functions
INFO LanceKnnExec: Executing KNN search

Troubleshooting

"lance_knn: table 'xxx' not found in registry"

Ensure your context file uses type: "lance" for the data source and the table name matches.

"lance_knn: subquery must return exactly one column"

The query vector subquery must return a single column containing the vector.

"lance_knn: query_vector must be literal array or scalar subquery"

The third argument must be either a literal array [0.1, 0.2, ...] or a scalar subquery (SELECT vector FROM ...).

"Distance values seem incorrect"

Distance metric is determined by the Lance index. Check your index configuration:

  • L2 (Euclidean): Default for most cases
  • Cosine: Better for normalized vectors
  • Dot Product: For inner product similarity

Additional Resources