Deployment

Operations Runbook

Monitoring, metrics, and operational procedures

Audience: Operators, support engineers. Use: Follow routine operating procedures and incident-response steps for BSFG systems.

Operational Overview

BSFG has a simple operational model due to its stateless-at-boundary architecture. The key operational concern is durability and connectivity — not complex choreography.

Key Metrics

Monitor these metrics for each BSFG node and zone:

Replication Lag

The delay between a fact being appended to ISB/ESB and confirmed at IFB/EFB.

Consumer Backlog

The number of unconfirmed facts at a forward buffer (IFB/EFB).

Buffer Fill Percentage

Store buffers (ISB/ESB) have configurable capacity (e.g., 100GB). Track fill ratio:

Confirmation Rate

Facts confirmed per second at each forward buffer.

TLS Handshake Errors

mTLS connection failures:

RPC Latency

Time to complete RPC operations (AppendFact, FetchFacts, ConfirmReceipt, PutObject):

Alert Thresholds

Condition Threshold Action
Replication lag > 1 second 1 second Check network, verify boundary connectivity
Consumer backlog > 10,000 10,000 facts Check consumer health, verify processing
Buffer fill > 80% 80% capacity Check retention policy, trigger cleanup
TLS handshake error Any failure Immediately — verify certificates, CA trust
Certificate expires in < 30 days 30 days Begin renewal workflow
Fact TTL expires (7 days default) TTL threshold Alert if unconfirmed facts will be truncated

Backpressure Policy

When buffer capacity approaches limits, BSFG enforces a backpressure policy.

Standard Deployment (Non-Safety-Critical)

if (buffer_fill >= 80%) {
  // Two options (choose one):
  option_1: reject new AppendFact calls
            (return error to producer)
  option_2: drop oldest unacknowledged facts
            (truncate without waiting for confirmation)
}
    

Safety-Critical / SIL-Regulated Deployment

In regulated environments (FDA, IEC 61508):

if (buffer_fill >= 80%) {
  MUST: reject new AppendFact calls
  MUST_NOT: drop unacknowledged data
  ACTION: alert operations team, trigger manual intervention
}
    

Failure Mode Analysis

1. ISB Crash (Store Buffer Failure)

Behavior:
  - Producers unable to append facts (AppendFact fails)
  - Existing facts in ISB are lost (if not replicated)
  - Consumers can still fetch from IFB (if facts already transferred)

Recovery:
  1. Detect: AppendFact returns error for > 1 minute
  2. Alert operations team
  3. Restart ISB (with data recovery if applicable)
  4. Producers retry AppendFact (idempotent)
  5. Confirm replication lag recovers
    

2. IFB Crash (Forward Buffer Failure)

Behavior:
  - Consumers unable to fetch facts (FetchFacts fails)
  - Cursor does not advance (confirmations stall)
  - ISB continues accepting writes (but will overflow if IFB stays down)

Recovery:
  1. Detect: FetchFacts returns error or consumer backlog > 10k
  2. Alert operations team
  3. Restart IFB (with data recovery)
  4. Verify cursor is recovered from checkpoint
  5. Consumers retry FetchFacts
  6. Confirm confirmation rate recovers
    

3. Network Partition (Boundary Unreachable)

Behavior:
  - Zone A BSFG cannot reach Zone B BSFG
  - Gate closes: autonomous mode activated
  - Producers in Zone A continue writing to ISB
  - Consumers in Zone A continue reading from IFB
  - Replication lag stalls (frontier does not advance)
  - Buffer fill increases over time (facts not replicated)

Duration: Minutes to hours (network partitions)

Recovery:
  1. Monitor: replication lag > 30s = probable partition
  2. Verify: ping / traceroute to peer zone
  3. Check: firewall rules, TLS certificate validity, peer availability
  4. Fix: repair network, restore DNS, update firewall rules
  5. Reconnect: Reconciliation mode activates
  6. Replay: store buffer replays unconfirmed facts to forward buffer
  7. Confirm: cursor advances, buffer drains, replication lag returns to normal
    

4. Hash Collision (Idempotency Key Collision)

Behavior (extremely rare):
  - Two different facts hash to the same idempotency_key
  - putIfAbsent rejects the second fact (already exists)
  - Producer sees "AlreadyExists" error

Prevention:
  - Use strong hash (SHA-256, not MD5 or CRC)
  - Use explicit producer event IDs (not payload hash) if hash collisions are a concern
  - Monitor for unusual rejection rates

Recovery:
  - Producer should emit a new fact with a different message_id
  - Update business process to avoid the collision
    

5. Buffer Exhaustion (Capacity Limits Exceeded)

Behavior:
  - Buffer reaches 100% capacity (e.g., 100GB ISB full)
  - Backpressure policy activates: reject or drop
  - Producers may experience failures or data loss (if drop-oldest is enabled)

Root causes:
  - Consumers are dead or hung (not confirming)
  - TTL too long (facts retained too long)
  - Throughput too high for capacity

Recovery:
  1. Alert: buffer_fill == 100%
  2. Diagnosis: check consumer status, confirm rates, backlog
  3. Action:
     - If consumer down: restart consumer
     - If throughput high: increase capacity or reduce TTL
     - If misconfigured: review retention policy
  4. Drain: buffer fill decreases as facts are confirmed and truncated
    

Node Upgrades and Restarts

Rolling Upgrade (HA Setup)

If a zone has multiple BSFG node instances:

  1. Drain traffic from node 1 (stop accepting new connections)
  2. Wait for existing RPC calls to complete (grace period: 30 seconds)
  3. Upgrade node 1 (binary, configuration, dependencies)
  4. Restart node 1
  5. Verify connectivity: test RPC calls to peer zones
  6. Re-enable traffic to node 1
  7. Repeat for nodes 2, 3, etc.

Single Node Upgrade (No HA)

Without HA, the upgrade causes temporary unavailability:

  1. Plan upgrade during maintenance window
  2. Notify consumers and producers (may see timeouts)
  3. Upgrade and restart BSFG node
  4. Verify recovery: check replication lag, consumer backlog
  5. Confirm zone is healthy before allowing normal traffic

Certificate Rotation

Planned Rotation (Before Expiry)

  1. Generate new certificate with same CN (zone identity)
  2. Install new certificate on BSFG node (or all instances)
  3. Reload or restart BSFG service
  4. Verify TLS handshake succeeds with peer zones
  5. Confirm RPC connectivity with peers
  6. Archive old certificate for audit trail

Emergency Rotation (Compromised Key)

  1. Immediately generate new certificate (new key)
  2. Install new certificate
  3. Restart BSFG node (force reconnection with peer zones)
  4. Monitor for connection errors (peers must trust new certificate)
  5. If peers use pinned CA, notify them to reload CA root
  6. Destroy old private key securely

Operational Checklist