BSFG ADR-0018

Status: Accepted

Date: 2026-03-06

Context

BSFG is a boundary appliance. Its failure modes are operational before they are semantic: append failures, replication lag, confirmation gaps, object-store write issues, mTLS problems, and retention pressure. Operators need enough visibility to answer:

is the boundary service healthy?
are facts being appended and confirmed?
is cross-zone lag growing?
are attachments being stored successfully?
are retries, duplicates, or auth failures increasing?

The observability model must therefore support routine operations and incident response without turning BSFG into a telemetry-heavy platform in its own right.

Options Considered

Option	Description	Benefits	Drawbacks
Logs only	Emit application logs and rely on downstream log search for all operational diagnosis.	simple implementation minimal surface area	weak aggregate visibility poor alerting basis lag and throughput trends are hard to see
Metrics only	Expose counters, gauges, and histograms, but avoid detailed logs.	good dashboards and alerting small telemetry volume	poor incident forensics hard to explain individual failures or rejected operations
Metrics + logs + full distributed tracing	Adopt tracing across every request, append, fetch, confirm, and object-store operation.	maximal visibility strong end-to-end request analysis	heavier operational footprint more moving parts overkill for the current appliance scope
Metrics + structured logs (Selected)	Expose operational metrics for monitoring and alerting, plus structured logs for event-level diagnosis.	good operational baseline supports dashboards and alerting preserves incident-level detail keeps telemetry surface moderate	correlation across systems is not as rich as full tracing log structure must be governed consistently

Decision

BSFG will expose metrics and structured logs as its standard operational visibility model.

Metrics are used for health, capacity, throughput, lag, and error-rate monitoring. Structured logs are used for append failures, conflicting duplicates, object-store errors, authorization failures, and operator diagnosis.

Example metrics include:

bsfg_append_rate
bsfg_fetch_rate
bsfg_confirm_rate
bsfg_dedupe_hits
bsfg_conflicting_duplicates
bsfg_replication_lag
bsfg_auth_failures
object_store_put_rate
object_store_put_failures

Structured log entries should include enough correlation data to connect an operational event to its semantic context, for example:

message_id
from_zone
to_zone
stream
subject
predicate
correlation_id
error_code where applicable

Full distributed tracing is not part of the baseline architecture, but can be introduced later if operational evidence shows it is necessary.

Consequences

Benefits:

clear alerting and dashboard foundation
good enough forensic detail for most incidents
moderate operational complexity for a boundary appliance
separation between aggregate monitoring and per-event diagnosis

Tradeoffs:

end-to-end request reconstruction across multiple systems remains less direct than with tracing
structured logging discipline must be maintained over time
teams may still introduce tracing later for specific failure modes

← Previous ADR Next ADR →