Architecture Decision Record

BSFG ADR-0018

Status: Accepted · Date: 2026-03-06

Status: Accepted

Date: 2026-03-06

Context

BSFG is a boundary appliance. Its failure modes are operational before they are semantic: append failures, replication lag, confirmation gaps, object-store write issues, mTLS problems, and retention pressure. Operators need enough visibility to answer:

  • is the boundary service healthy?
  • are facts being appended and confirmed?
  • is cross-zone lag growing?
  • are attachments being stored successfully?
  • are retries, duplicates, or auth failures increasing?

The observability model must therefore support routine operations and incident response without turning BSFG into a telemetry-heavy platform in its own right.

Options Considered

Option Description Benefits Drawbacks
Logs only Emit application logs and rely on downstream log search for all operational diagnosis. simple implementation
minimal surface area
weak aggregate visibility
poor alerting basis
lag and throughput trends are hard to see
Metrics only Expose counters, gauges, and histograms, but avoid detailed logs. good dashboards and alerting
small telemetry volume
poor incident forensics
hard to explain individual failures or rejected operations
Metrics + logs + full distributed tracing Adopt tracing across every request, append, fetch, confirm, and object-store operation. maximal visibility
strong end-to-end request analysis
heavier operational footprint
more moving parts
overkill for the current appliance scope
Metrics + structured logs (Selected) Expose operational metrics for monitoring and alerting, plus structured logs for event-level diagnosis. good operational baseline
supports dashboards and alerting
preserves incident-level detail
keeps telemetry surface moderate
correlation across systems is not as rich as full tracing
log structure must be governed consistently

Decision

BSFG will expose metrics and structured logs as its standard operational visibility model.

Metrics are used for health, capacity, throughput, lag, and error-rate monitoring. Structured logs are used for append failures, conflicting duplicates, object-store errors, authorization failures, and operator diagnosis.

Example metrics include:

bsfg_append_rate
bsfg_fetch_rate
bsfg_confirm_rate
bsfg_dedupe_hits
bsfg_conflicting_duplicates
bsfg_replication_lag
bsfg_auth_failures
object_store_put_rate
object_store_put_failures

Structured log entries should include enough correlation data to connect an operational event to its semantic context, for example:

  • message_id
  • from_zone
  • to_zone
  • stream
  • subject
  • predicate
  • correlation_id
  • error_code where applicable

Full distributed tracing is not part of the baseline architecture, but can be introduced later if operational evidence shows it is necessary.

Consequences

Benefits:

  • clear alerting and dashboard foundation
  • good enough forensic detail for most incidents
  • moderate operational complexity for a boundary appliance
  • separation between aggregate monitoring and per-event diagnosis

Tradeoffs:

  • end-to-end request reconstruction across multiple systems remains less direct than with tracing
  • structured logging discipline must be maintained over time
  • teams may still introduce tracing later for specific failure modes