Skip to main content
Version: 0.2.0

Observability

skardi-server exports traces and metrics via OpenTelemetry (OTLP gRPC), giving you full visibility into query execution inside Grafana.

What is Instrumented

  • Traces — Every DataFusion execution plan node is wrapped with a span. You can see elapsed_compute, output_rows, spill counts, and optimizer rule timings per query.
  • Pipeline metrics — Per-pipeline request count, latency, and error rate exported via OTLP on every /execute call.
  • Logs — Structured logs forwarded to Loki via the OTel Collector.

Pipeline Metrics

Every call to /:name/execute records two OTel metrics:

MetricTypeLabelsDescription
pipeline_requests_totalCounterpipeline, status (success/error), error_typeTotal requests by outcome
pipeline_latency_msHistogrampipelineEnd-to-end handler latency in milliseconds

Verify metrics are flowing — the OTel Collector exposes a Prometheus scrape endpoint at :8889. After firing at least one request, run:

curl -s http://localhost:8889/metrics | grep pipeline

Expected output:

pipeline_latency_ms_bucket{pipeline="my-pipeline",le="..."} 2
pipeline_latency_ms_count{pipeline="my-pipeline"} 2
pipeline_latency_ms_sum{pipeline="my-pipeline"} 45.3
pipeline_requests_total{pipeline="my-pipeline",status="success"} 1
pipeline_requests_total{pipeline="my-pipeline",status="error",error_type="parameter_validation_error"} 1

Note: The periodic exporter flushes every 60 seconds. If the output is empty, wait a moment and retry.

Example PromQL queries for Grafana dashboards:

# Requests per second per pipeline
rate(pipeline_requests_total[5m])

# Error rate (fraction of failing requests)
rate(pipeline_requests_total{status="error"}[5m])
/ rate(pipeline_requests_total[5m])

# p99 latency per pipeline
histogram_quantile(0.99, rate(pipeline_latency_ms_bucket[5m]))

# p50 latency per pipeline
histogram_quantile(0.50, rate(pipeline_latency_ms_bucket[5m]))

Local Observability Stack

A docker-compose.yml is included with a pre-configured Grafana stack:

ServiceRolePort
OTel CollectorReceives OTLP from the server, fans out to backends4317 (gRPC), 4318 (HTTP)
Grafana TempoTrace storage
PrometheusMetric storage (scrapes collector at :8889)9090
Grafana LokiLog storage
GrafanaVisualization (Tempo + Prometheus + Loki pre-provisioned)3000
# Start the stack
docker-compose -f observability/docker-compose.yml up -d

# Run the server pointing at the collector
OTLP_ENDPOINT=http://localhost:4317 RUST_LOG=debug cargo run -p skardi-server -- --port 8080

Then open Grafana at http://localhost:3000 — all three datasources (Tempo, Prometheus, Loki) are pre-provisioned.

Environment Variables

VariableDefaultDescription
OTLP_ENDPOINThttp://localhost:4317OTLP gRPC endpoint to export to
RUST_LOGinfoLog level. Use debug to see per-query DataFusion span detail

Log Levels

  • RUST_LOG=info — normal production operation; high-level server events only
  • RUST_LOG=debug — shows per-query DataFusion execution spans (plan nodes, row counts, optimizer timing)