Deployment

Observability & Operations Guide

Monitoring, metrics, logging, and operational procedures for BSFG nodes

Audience: Operators, SREs, platform engineers. Use: Understand the monitoring model, alerting signals, and operational visibility expectations.

Overview

BSFG deployments require operational discipline to maintain reliability in industrial environments. Operators must be able to detect issues, understand system state, and perform safe upgrades without data loss.

This guide provides practical procedures for:

Monitoring Architecture

Typical BSFG monitoring architecture involves:

BSFG Node
├── Metrics (Prometheus format)
│   └→ Prometheus Server
│       └→ Grafana Dashboards
└── Logs (structured JSON or syslog)
  └→ Log Aggregation (ELK, OpenSearch, Splunk, Loki)

Components

Core Metrics

BSFG should expose metrics in Prometheus format. Key metrics to monitor:

Replication Metrics

Metric Name Type Description Healthy Range
bsfg_replication_lag_seconds Gauge Seconds between fact appended to ISB and confirmed at IFB < 1 second
bsfg_replication_lag_high_watermark Gauge Highest replication lag observed (for alerting on spikes) < 5 seconds
bsfg_frontier_offset Gauge Current frontier (highest contiguous committed offset) Increasing monotonically
bsfg_fetch_requests_total Counter Total FetchFacts RPC calls (peer-to-peer) Steady or increasing
bsfg_confirm_requests_total Counter Total ConfirmReceipt RPC calls Matches fetch activity

Ingestion Metrics

Metric Name Type Description Healthy Range
bsfg_append_requests_total Counter Total AppendFact RPC calls (producer ingestion) Matches producer throughput
bsfg_append_failures_total Counter AppendFact failures (retryable or permanent) Zero or very low
bsfg_store_buffer_fill_percent Gauge Percentage of ISB/ESB capacity used < 50%
bsfg_append_latency_seconds Histogram Time to append fact (p50, p95, p99) p99 < 100ms

Consumer Metrics

Metric Name Type Description Healthy Range
bsfg_consumer_backlog_size Gauge Unconfirmed facts waiting for consumer processing < 100 facts
bsfg_consumer_lag_seconds Gauge Age of oldest unconfirmed fact at IFB/EFB < 10 seconds
bsfg_confirm_rate_per_second Gauge Facts confirmed per second (consumer throughput) Steady, matching producer rate

Artifact Metrics

Metric Name Type Description Healthy Range
bsfg_artifact_upload_total Counter Total PutObject calls (artifact uploads) Matches producer artifact activity
bsfg_artifact_upload_failures Counter Failed artifact uploads Zero or very low
bsfg_artifact_retrieval_total Counter Total GetObject calls (artifact downloads) Matches consumer artifact activity
bsfg_artifact_retrieval_failures Counter Failed artifact retrievals (missing or inaccessible) Zero
bsfg_object_store_usage_bytes Gauge Artifact storage used by bucket Trending within capacity

System Metrics

Metric Name Type Description Healthy Range
bsfg_node_up Gauge Node health indicator (1 = up, 0 = down) 1 (up)
bsfg_certificate_expiry_seconds Gauge Seconds until mTLS certificate expires > 2,592,000 (30 days)
bsfg_tls_handshake_errors_total Counter mTLS handshake failures (peer auth failure) Zero
bsfg_rpc_latency_seconds Histogram RPC call duration (AppendFact, FetchFacts, etc.) p99 < 500ms

Alerting Rules

Configure alerts based on these thresholds. Adjust thresholds per deployment based on expected throughput and latency.

Condition Threshold Severity Action
Replication lag > 5 seconds 5 sec Warning Check network, verify peer connectivity
Replication lag > 30 seconds 30 sec Critical Check for network partition, node failure, or consumer backlog
Consumer backlog > 1000 facts 1000 Warning Check consumer health; may be processing slowly or stalled
Consumer lag > 60 seconds 60 sec Critical Consumer is severely behind; investigate failure
Store buffer fill > 80% 80% Warning Buffer approaching capacity; check replication and consumer progress
Store buffer fill > 95% 95% Critical Buffer near exhaustion; risk of producer backpressure or data loss
Append failures > 0 (per minute) > 0 Warning Producer experiencing errors; check why AppendFact is failing
Artifact retrieval failures > 0 > 0 Critical Consumer cannot retrieve artifact; storage issue or missing artifact
TLS handshake errors > 0 > 0 Critical Peer authentication failing; check certificates and CA trust
Certificate expiry in < 30 days 30 days Warning Begin certificate renewal process
Node down (bsfg_node_up = 0) N/A Critical Node is unreachable; check health, restart if necessary

Log Structure

BSFG nodes should emit structured logs for observability. Logs should include:

Log Fields

Example Log Entry

{
  "timestamp": "2026-03-06T14:30:45.123Z",
  "level": "INFO",
  "zone": "enterprise-bsfg",
  "message_id": "msg_abc123def456",
  "operation": "AppendFact",
  "predicate": "order_created",
  "result": "success",
  "duration_ms": 12,
  "buffer_fill_percent": 45
}

Troubleshooting Scenarios

Replication Has Stopped (Lag Growing)

Symptom: Replication lag continuously increases, facts are not flowing between zones.

Diagnosis:

  1. Check replication lag metric: is it > 30 seconds?
  2. Check consumer backlog: is it growing?
  3. Check network connectivity: ping between BSFG nodes
  4. Check firewall rules: is traffic allowed on RPC port (9443)?
  5. Check certificates: are they valid and trusted?
  6. Check logs for TLS errors or connection timeouts

Resolution:

Consumer Backlog Growing (Not Draining)

Symptom: Consumer backlog continuously increases; facts are fetched but not confirmed.

Diagnosis:

  1. Check confirm rate: is it zero or very low?
  2. Check consumer process: is it running or hung?
  3. Check logs for consumer errors or exceptions
  4. Check if consumer is idempotently processing (not getting stuck on duplicates)

Resolution:

Artifact Retrieval Failures

Symptom: GetObject calls failing with "not found" or access denied errors.

Diagnosis:

  1. Check artifact reference in fact: bucket, key, digest
  2. Check object store: does the bucket exist? Is the key present?
  3. Check storage access: can BSFG node read from the bucket?
  4. Check retention policy: has the artifact been garbage collected?

Resolution:

TLS Handshake Failures

Symptom: TLS handshake errors in logs, replication unable to establish connections.

Diagnosis:

  1. Check certificate validity: is it expired?
  2. Check certificate CN: does it match the peer's zone identity?
  3. Check CA trust: is the certificate CA trusted by peers?
  4. Check certificate chain: is the full chain available?

Resolution:

Store Buffer Exhaustion

Symptom: Store buffer fill > 95%, risk of backpressure or data loss.

Diagnosis:

  1. Check replication lag: facts being produced faster than replicated?
  2. Check consumer backlog: facts being replicated faster than consumed?
  3. Check retention policy: TTL too long, facts not being truncated?

Resolution:

Operational Playbooks

Playbook: Safe Node Restart

Restarting a BSFG node is safe due to durability guarantees:

  1. Check replication lag to baseline (record pre-restart state)
  2. Stop BSFG service gracefully (signal SIGTERM, wait for shutdown)
  3. Verify service stopped (check process list)
  4. Start BSFG service (service restart or systemctl restart)
  5. Wait 30 seconds for startup and peer reconnection
  6. Check replication lag metric: should return to baseline within 1 minute
  7. Check consumer backlog: should drain as normal
  8. Check logs for any errors during startup

Playbook: Certificate Rotation (Planned)

Rotate certificates before expiration without service interruption:

  1. Generate new certificate and key for zone (e.g., enterprise-bsfg)
  2. Sign certificate with enterprise PKI CA
  3. Copy certificate and key to BSFG node(s)
  4. Reload BSFG configuration (SIGHUP or restart)
  5. Verify certificate loaded: check certificate_expiry_seconds metric
  6. Test peer connectivity: RPC calls succeed
  7. Archive old certificate for audit trail
  8. Document rotation in change log

Playbook: Emergency Certificate Rotation (Key Compromise)

If a node's private key is compromised, rotate immediately:

  1. Generate new certificate and key immediately
  2. Copy new cert/key to node
  3. Restart BSFG service (forces reconnection with new certificate)
  4. Verify peers accept new certificate (check TLS handshake success)
  5. Revoke old certificate in PKI system
  6. Destroy old private key securely
  7. Alert security team and document incident

Playbook: Rolling Upgrade (Multi-Node HA)

Upgrade BSFG binary or dependencies with zero downtime:

  1. Plan upgrade during low-traffic period (if possible)
  2. Drain traffic from node 1 (stop accepting new RPC calls)
  3. Wait 30 seconds for in-flight RPC to complete
  4. Stop node 1 gracefully
  5. Upgrade binary and dependencies on node 1
  6. Start node 1, wait for peer reconnection (30 seconds)
  7. Verify replication resumes: check lag metric
  8. Re-enable traffic to node 1
  9. Repeat for nodes 2, 3, etc. (one at a time)

Monitoring Dashboard Recommendations

A comprehensive Grafana dashboard should include:

Pre-Deployment Checklist

Cross-Links to Related Documentation