Instrumentation Recipe
How to add per-stage tracing to a responder, using the pattern established for SearchResponderV2
(Gravsearch). Follow it to instrument a second vertical without re-deriving the design. The
reference implementation lives in
webapi/src/main/scala/org/knora/webapi/responders/v2/SearchResponderV2.scala.
The pattern in one sentence: open one root INTERNAL span named after the vertical, wrap each
pipeline stage in a child span via a small stageSpan helper, attach a bounded shape fingerprint
to the root, and make failures and interruptions legible without leaking user data.
1. Wire Tracing into the service
Declare tracing as an abstract member of the trait (not only a constructor param of the live
impl) so that any default methods on the trait can open spans before delegating:
trait SearchResponderV2 {
// Telemetry used to open the root span and its per-stage child spans. Declared as an abstract
// member so the trait's default methods can open the root + parse spans before delegating.
protected def tracing: Tracing
// ...
}
Provide it in the live class and add Tracing to the module's Dependencies alias so
ZLayer.derive picks it up from the environment:
final class SearchResponderV2Live(
// ...other deps...
override protected val tracing: Tracing,
) extends SearchResponderV2
// SearchResponderV2Module.scala
type Dependencies = /* ...other deps... */ & Tracing
2. Add the stageSpan helper
Copy this helper (companion object of the responder). It opens an INTERNAL span that is
automatically a child of whatever span is active on the fiber, records a sanitized error on
failure, and marks interruptions — then maps the library status to UNSET so the library's own
status-setter is a no-op and never overwrites what we set.
def stageSpan[A](tracing: Tracing, name: String)(effect: Task[A]): Task[A] =
tracing.span(name, SpanKind.INTERNAL, statusMapper = unsetOnFailure) {
tracing.getCurrentSpanUnsafe.flatMap { span =>
effect
.tapErrorCause(cause => ZIO.succeed(markSanitizedError(span, name, cause)))
.onExit {
case Exit.Failure(cause) if cause.isInterrupted =>
ZIO.succeed {
val _ = span.setAttribute("gravsearch.exit_reason", "interrupted")
val _ = span.setStatus(StatusCode.ERROR, "interrupted")
}
case _ => ZIO.unit
}
}
}
A thin protected final wrapper on the trait lets methods call stageSpan("name") { ... } without
passing tracing each time:
protected final def stageSpan[A](name: String)(effect: Task[A]): Task[A] =
SearchResponderV2.stageSpan(tracing, name)(effect)
The root span is opened with the same helper — there is no separate root helper. Open the root, then open each stage inside it; FiberRef-carried context makes them children automatically.
3. Name the spans
- Root span = the vertical name:
gravsearch. - Stage spans =
<vertical>.<stage>, lowercase, dotted, from a bounded set:gravsearch.parse,gravsearch.type_inspection,gravsearch.prequery.generate,gravsearch.prequery.execute,gravsearch.mainquery.generate,gravsearch.mainquery.execute,gravsearch.result_transform. - Never put variable data (IRIs, counts, user input) in a span name — that explodes cardinality. Variable data goes in attributes, bounded data goes in the shape (step 5).
4. Wrap each stage — and omit stages that did not run
Wrap each stage effect in stageSpan. For example, the main-query trio runs only when the prequery
returned at least one resource — keep those spans inside the conditional so an empty result
simply has no main-query spans, rather than zero-duration placeholders:
mainQueryResults <-
if (mainResourceIris.nonEmpty) {
for {
sparql <- stageSpan("gravsearch.mainquery.generate")(/* build SPARQL */)
response <- stageSpan("gravsearch.mainquery.execute")(/* triplestore.query(...) */)
result <- stageSpan("gravsearch.result_transform")(/* permission filter + assemble */)
} yield result
} else {
ZIO.attempt(/* empty result */)
}
Absent spans are a documented, legible signal — see the runbook's
four absent-data topologies.
The triplestore CLIENT span nests automatically under the *.execute stage because it runs inside
that stage's effect.
5. Attach a bounded shape, not user data
The single most important attribute rule: never set raw query text, instance IRIs, or user IDs as attributes. Instead derive a bounded shape from the parsed query and attach it to the root span:
def setShapeOnRoot(tracing: Tracing, query: ConstructQuery, resultType: QueryResultType): UIO[Unit] =
tracing.getCurrentSpanUnsafe.map { span =>
val shape = queryShape(query, resultType)
val _ = span.setAttribute("gravsearch.query.shape", shape.label)
val _ = span.setAttribute("gravsearch.schema_predicates", shape.predicates.mkString(","))
shape.flags.foreach { case (flag, value) => val _ = span.setAttribute(s"gravsearch.shape.$flag", value) }
}
Split the cardinality deliberately:
| Kind | Example | Cardinality | Use as |
|---|---|---|---|
| Composite shape label | gravsearch.query.shape = resource-list\|has_filter\|patterns:4-7\|joins:1 |
Bounded (enums + bucketed counts) | Span attribute, safe as a metric label |
| Per-flag booleans | gravsearch.shape.has_filter = true |
Bounded (fixed flag set) | Span attribute (for TraceQL filtering) |
| Ontology predicate names | gravsearch.schema_predicates = hasTitle,isPartOf |
Higher (but ontology-bounded, never instance IRIs) | Span attribute only — never a metric label |
Bucket open-ended counts (pattern count, join count) into ranges (0, 1, 2-3, 4-7, 8+) so
the shape label stays bounded. Set the shape on the root immediately after parse succeeds.
6. Errors and interruptions without leaks
The error handling has one load-bearing invariant. zio-telemetry writes cause.prettyPrint into
the span status description on the ERROR branch — and for a SPARQL failure that string echoes the
offending FILTER literal (user data). To prevent the leak, the failure status mapper must map to
UNSET (which the OTel SDK no-ops), and we set our own sanitized status separately:
// LOAD-BEARING: must map to UNSET, never ERROR — UNSET is what stops cause.prettyPrint
// (which echoes the user's FILTER literal) from reaching the span status description.
private val unsetOnFailure: StatusMapper[Throwable, Any] =
StatusMapper.failureNoException[Throwable](_ => StatusCode.UNSET)
/** Writes the sanitized ERROR status ("<stage>: <Class>", no message) + error.type onto the span. */
private def markSanitizedError(span: Span, stage: String, cause: Cause[Throwable]): Unit = {
val kind = cause.failureOption.map(_.getClass.getSimpleName).getOrElse("defect")
val _ = span.setStatus(StatusCode.ERROR, s"$stage: $kind")
cause.failureOption.foreach { e => val _ = span.setAttribute("error.type", e.getClass.getSimpleName) }
}
- Typed failure → status
ERROR, description exactly"<stage>: <ClassName>"(e.g.gravsearch.prequery.execute: TriplestoreException), pluserror.type. No message, no stacktrace. - Interruption →
gravsearch.exit_reason = interrupted+ statusERROR "interrupted"(set instageSpan'sonExit). OTel has nocancelledstatus, so this attribute is what distinguishes an interrupted query from a typed failure and from a benign empty result.
Do not relax the status mapper
Changing unsetOnFailure to map failures to ERROR re-introduces the cause.prettyPrint leak
in one edit. It is guarded by a description-equality test — keep that test.
Checklist for a new vertical
-
tracingis an abstract member of the trait;Tracingadded to the moduleDependencies. - One root
INTERNALspan named after the vertical; one child span per stage, bounded names. - Stages that may not run are wrapped inside their conditional (no placeholder spans).
- A bounded shape on the root; no raw text / instance IRIs / user IDs as attributes.
- Cardinality split: composite label + booleans are metric-safe; predicate lists are drill-down only.
- Failure mapper maps to
UNSET; sanitizedERROR+error.typeset explicitly; interruption setsexit_reason. - A test asserting the failure status description equals
"<stage>: <Class>"(no message).