Context
BSFG is a boundary appliance. Its failure modes are operational before they are semantic: append failures, replication lag, confirmation gaps, object-store write issues, mTLS problems, and retention pressure. Operators need enough visibility to answer:
- is the boundary service healthy?
- are facts being appended and confirmed?
- is cross-zone lag growing?
- are attachments being stored successfully?
- are retries, duplicates, or auth failures increasing?
The observability model must therefore support routine operations and incident response without turning BSFG into a telemetry-heavy platform in its own right.
Options Considered
| Option | Description | Benefits | Drawbacks |
|---|---|---|---|
| Logs only | Emit application logs and rely on downstream log search for all operational diagnosis. |
- simple implementation
- minimal surface area
|
- weak aggregate visibility
- poor alerting basis
- lag and throughput trends are hard to see
| | Metrics only | Expose counters, gauges, and histograms, but avoid detailed logs. |
- good dashboards and alerting
- small telemetry volume
|
- poor incident forensics
- hard to explain individual failures or rejected operations
| | Metrics + logs + full distributed tracing | Adopt tracing across every request, append, fetch, confirm, and object-store operation. |
- maximal visibility
- strong end-to-end request analysis
|
- heavier operational footprint
- more moving parts
- overkill for the current appliance scope
| | Metrics + structured logs (Selected) | Expose operational metrics for monitoring and alerting, plus structured logs for event-level diagnosis. |
- good operational baseline
- supports dashboards and alerting
- preserves incident-level detail
- keeps telemetry surface moderate
|
- correlation across systems is not as rich as full tracing
- log structure must be governed consistently
|
Decision
BSFG will expose metrics and structured logs as its standard operational visibility model.
Metrics are used for health, capacity, throughput, lag, and error-rate monitoring. Structured logs are used for append failures, conflicting duplicates, object-store errors, authorization failures, and operator diagnosis.
Example metrics include:
bsfg_append_rate
bsfg_fetch_rate
bsfg_confirm_rate
bsfg_dedupe_hits
bsfg_conflicting_duplicates
bsfg_replication_lag
bsfg_auth_failures
object_store_put_rate
object_store_put_failures
Structured log entries should include enough correlation data to connect an operational event to its semantic context, for example:
message_idfrom_zoneto_zonestreamsubjectpredicatecorrelation_iderror_codewhere applicable
Full distributed tracing is not part of the baseline architecture, but can be introduced later if operational evidence shows it is necessary.
Consequences
Benefits:
- clear alerting and dashboard foundation
- good enough forensic detail for most incidents
- moderate operational complexity for a boundary appliance
- separation between aggregate monitoring and per-event diagnosis
Tradeoffs:
- end-to-end request reconstruction across multiple systems remains less direct than with tracing
- structured logging discipline must be maintained over time
- teams may still introduce tracing later for specific failure modes