Runbook: Cross-Zone Federation Bring-Up
Purpose
Operationalize the establishment of authenticated, authorized, cursor-driven interaction between already-functional autonomous BSFG zones.
This runbook defines how operators create federation relationships without breaking local autonomy or introducing cross-zone availability dependencies.
Scope
This runbook covers:
- Peer identity and trust establishment
- Authorization configuration
- Exported stream configuration
- Cursor initialization policy
- Initial replay/fetch validation
- Partition and recovery validation
This runbook does not cover:
- Local zone deployment (see Runbook: Triad-HA Zone Deployment)
- Intra-zone failover mechanics
- Application-level integration
Reference
This runbook operationalizes the Reference Interaction Pattern: Cross-Zone BSFG Federation.
1. Preconditions
Verify all before proceeding. Halt if any precondition fails.
| Check | Method | Expected Result |
|---|---|---|
| Both zones healthy locally | Checklist: Triad-HA Commissioning passed | Both zones show "Passed" or "Passed with Exception" |
| Endpoints reachable and authenticatable | Approved health/TLS probe from each zone | Peer endpoint reachable; TLS handshake succeeds with configured trust; authenticated health response returned |
| Certificates issued for both zones | openssl x509 -in /opt/bsfg/certs/server.crt -noout -dates |
Valid, not expired |
| Trust anchors distributed | CA certificates present and verify peer chain | ca.crt readable; peer certificate chains to installed trust anchor |
| Authorization policy approved | Ticket or document reference | Matrix of allowed streams per peer approved |
| Exported streams identified | Message catalog or stream list | facts.operational, facts.audit, etc. defined |
| Artifact access policy defined | Document reference | Which artifact types accessible per peer |
| Monitoring available on both sides | Dashboard verification | Metrics and alerts visible for both zones |
2. Inputs Required
| Input | Description | Example |
|---|---|---|
ZONE_A_IDENTITY |
Initiating zone name and certificate identity | enterprise, identity enterprise-bsfg (subject or SAN per policy) |
ZONE_B_IDENTITY |
Target zone name and certificate identity | plant-a, identity plant-a-bsfg (subject or SAN per policy) |
ZONE_A_ENDPOINT |
Zone A VIP and port | 10.1.1.10:9443 |
ZONE_B_ENDPOINT |
Zone B VIP and port | 10.3.1.10:9443 |
TRUST_CHAIN |
CA certificate or cross-signed trust anchor | /opt/bsfg/certs/ca.crt |
AUTHORIZATION_MATRIX |
Approved stream and artifact permissions | Document or config reference |
EXPORTED_STREAMS |
List of streams Zone A exports to Zone B | facts.operational, facts.batch_completed |
CURSOR_INIT_MODE |
How to initialize cursor for this relationship | bounded_backfill_24h (default), start_now, full_backfill, or explicit timestamp |
ARTIFACT_ACCESS_RULES |
Which artifact types accessible | batch-files: read, documents: read |
3. Trust and Identity Setup
3.1 Certificate Deployment Verification
On Zone A (and symmetrically on Zone B):
# Verify local certificate
openssl x509 -in /opt/bsfg/certs/server.crt -noout -subject -dates
# Expected: subject=CN = enterprise-bsfg, notBefore valid, notAfter future
# Verify peer CA trust (conceptual)
# Trust anchor must be present and must anchor peer certificate chain
# Example verification approaches:
# - openssl verify -CAfile /opt/bsfg/certs/ca.crt /path/to/peer/cert.pem
# - Verify via configured TLS client during health probe
# - Check certificate chain in TLS handshake output
Expected result: Local certificate valid; trust anchor present and correctly anchors peer certificate chain; peer identity matches configured zone identity policy.
Halt if: Certificate expired, identity does not match policy, chain broken, or trust anchor missing.
3.2 Peer Identity Verification
From Zone A, verify Zone B identity:
# Extract and verify Zone B certificate via TLS handshake
echo | openssl s_client -connect 10.3.1.10:9443 -servername plant-a-bsfg 2>/dev/null | openssl x509 -noout -subject
# Expected: subject or SAN contains identity matching zone policy (e.g., plant-a-bsfg)
From Zone B, verify Zone A identity (symmetric):
echo | openssl s_client -connect 10.1.1.10:9443 -servername enterprise-bsfg 2>/dev/null | openssl x509 -noout -subject -ext subjectAltName
# Expected: subject or SAN contains identity matching zone policy (e.g., enterprise-bsfg)
Expected result: Peer certificate identity (subject or SAN) matches configured zone identity policy.
Halt if: Identity mismatch, certificate chain broken, or hostname verification fails.
3.3 Revocation and Validity Policy Check
# Verify certificate not revoked (if CRL/OCSP configured)
openssl x509 -in /opt/bsfg/certs/server.crt -noout -ocsp_uri # Check if OCSP available
# If OCSP available: verify with ocsp command
# Verify certificate validity window
openssl x509 -checkend 2592000 -noout -in /opt/bsfg/certs/server.crt
# Expected: exit 0 (30+ days remaining)
Expected result: Certificate valid, not near expiry, not revoked.
Halt if: Expiry < 30 days or revocation detected.
4. Authorization Setup
4.1 Peer Allow-List Configuration
On Zone A, install rendered peer authorization configuration:
# Install managed configuration for peer authorization
# /opt/bsfg/config/peers.yaml or equivalent managed config artifact
# Example structure (illustrative only):
# peers:
# - id: plant-a
# identity: plant-a-bsfg
# endpoint: 10.3.1.10:9443
# authorized: true
# streams:
# - facts.operational
# - facts.batch_completed
# artifacts:
# - batch-files
# - documents
# Apply using approved configuration management:
# - Configuration management system (Ansible, Puppet, etc.)
# - Kubernetes ConfigMap/Secret if applicable
# - Manual install with change control approval
On Zone B, configure Zone A as authorized peer (symmetric or asymmetric as policy requires):
# Same approach: install rendered managed configuration
# Example structure (illustrative only):
# peers:
# - id: enterprise
# identity: enterprise-bsfg
# endpoint: 10.1.1.10:9443
# authorized: true
# streams:
# - facts.orders
# - facts.shipments
# artifacts:
# - order-files
Expected result: Peer allow-list installed and active, streams and artifacts explicitly authorized, unauthorized peers denied.
Verification:
- Query active configuration to confirm peer present in allow-list
- Attempt connection from unauthorized peer and verify denial
Halt if: Authorization matrix undefined, peer not in allow-list, or config management error.
4.2 Stream Export Permissions
Verify exported streams are correctly configured and BSFG authorization policy allows export:
# On Zone A: verify stream exists and durability is correct
nats stream info facts.operational --server nats://localhost:4222
# Expected: Stream exists, replicas: 3, durability confirmed
# Verify BSFG authorization policy (via config or admin query)
# Check that peer is authorized for this stream per peers.yaml or equivalent
Expected result: Streams exist, durability confirmed, BSFG authorization policy allows export to peer.
Note: NATS stream configuration shows substrate durability; federation authorization is enforced by BSFG policy layer.
4.3 Denial Behavior Verification
Test that unauthorized access is rejected:
# Attempt connection from unauthorized source (or simulate wrong identity)
# Using wrong certificate or untrusted CA:
curl --cacert /opt/bsfg/certs/ca.crt \
--cert /wrong/cert.pem --key /wrong/key.pem \
https://10.3.1.10:9443/health
# Expected: TLS handshake failure or application-layer rejection
# Or attempt without client cert (if mutual TLS required):
curl --cacert /opt/bsfg/certs/ca.crt https://10.3.1.10:9443/health
# Expected: TLS handshake failure (client cert required)
Expected result: Unauthorized peers cannot establish TLS or are rejected at application layer.
5. Cursor Initialization Policy
5.1 Select Initialization Mode
| Mode | When to Use | Implication |
|---|---|---|
bounded_backfill_24h (default) |
Normal production bring-up | Replays only the configured lookback window; earlier history is not requested unless separately backfilled |
bounded_backfill_Nh |
Known recent start point | Replay N hours; operator specifies N |
start_now |
Greenfield streams, no history needed | No backfill; only facts from now forward |
full_backfill |
Disaster recovery, complete reconstruction | Replay all history; may be massive |
explicit_timestamp |
Specific recovery point | Operator provides ISO timestamp |
Default: bounded_backfill_24h unless explicitly overridden per-stream.
5.2 Configure Cursor Initialization
On receiving zone (Zone B for Zone A→B flow), install cursor initialization configuration:
# Install managed configuration for cursor initialization
# /opt/bsfg/config/cursors.yaml or equivalent managed config artifact
# Example structure (illustrative only):
# cursors:
# - peer: enterprise
# stream: facts.operational
# init_mode: bounded_backfill_24h
# # Alternative: explicit_timestamp with value
# # init_timestamp: "2025-01-15T10:00:00Z"
# Apply using approved configuration management
Expected result: Cursor initialization policy documented, configured, and active.
Verification: Query active configuration to confirm policy applied.
Halt if: Policy undefined, contradicts business requirements (e.g., required history outside backfill window), or config management error.
5.3 Per-Stream Override Capability
Some streams may require different initialization. Document and install per-stream overrides:
# Example per-stream overrides (illustrative structure):
#
# Critical audit stream: full backfill
# - peer: enterprise
# stream: facts.audit
# init_mode: full_backfill
# justification: compliance requirement
#
# High-volume telemetry: start now
# - peer: enterprise
# stream: facts.telemetry
# init_mode: start_now
# justification: volume too high, only recent data valuable
# Install via approved configuration management with documented justification
Expected result: Per-stream overrides documented with justification and installed.
6. Initial Federation Bring-Up
6.1 Health Handshake
From Zone B (receiving), verify Zone A health with mutual TLS:
# Use approved health probe with client certificate and CA trust
curl --cacert /opt/bsfg/certs/ca.crt \
--cert /opt/bsfg/certs/server.crt --key /opt/bsfg/certs/server.key \
https://10.1.1.10:9443/health
# Expected: 200 OK, JSON with zone identity and health status
From Zone A, verify Zone B health (symmetric):
curl --cacert /opt/bsfg/certs/ca.crt \
--cert /opt/bsfg/certs/server.crt --key /opt/bsfg/certs/server.key \
https://10.3.1.10:9443/health
Expected result: Both zones healthy, identities confirmed, TLS trust verified (not skipped).
Halt if: Health check fails, identity mismatch, TLS trust failure, or authentication error.
6.2 Authorization Verification
Test that authorized streams are accessible:
# Zone B queries Zone A for available streams (if query primitive available)
# Or: attempt first fetch and verify authorization succeeds
Expected result: Authorization allows configured streams, denies others.
6.3 First Fetch
Initiate first cursor-based fetch from Zone B to Zone A using the approved BSFG operator interface:
# Using the approved BSFG operator interface (CLI or API), initiate first fetch.
# The exact command depends on your deployed BSFG realization.
#
# Example (illustrative only):
# bsfg fetch --peer enterprise --stream facts.operational --cursor-init bounded_backfill_24h
#
# Or via API if available:
# curl --cert /opt/bsfg/certs/server.crt --key /opt/bsfg/certs/server.key \
# -X POST https://10.1.1.10:9443/v1/fetch \
# -H "Content-Type: application/json" \
# -d '{"stream":"facts.operational","cursor_policy":"bounded_backfill_24h"}'
Expected result: Fetch succeeds, facts returned, no authorization error.
Note: Use the interface (CLI, API, or control plane) defined by your BSFG realization. The examples above are illustrative.
Halt if: Authorization denied, stream not found, or cursor initialization rejected.
6.4 First Durable Append
Verify fetched facts are durably appended to Zone B's local inbound durable store (inward-facing boundary role):
# Check local durable store for new facts
# Example using NATS JetStream (if that's your substrate realization):
nats stream info facts.operational --server nats://localhost:4222
# Expected: Messages count increased, LastSeq advanced
# Verify cursor position advanced using approved interface
# Example (illustrative): bsfg cursor query --peer enterprise --stream facts.operational
# Expected: Cursor position > initial, matches last durable append
Expected result: Facts durably appended to configured inbound realization, cursor advanced, monotonic progress confirmed.
Note: IFB (inward-facing boundary) is a logical role; your substrate realization may use different concrete names.
6.5 Cursor Advancement Confirmation
Verify cursor semantics:
# Cursor represents durable local append, not just fetch
# Re-query cursor
bsfg cursor query --peer enterprise --stream facts.operational
# Verify matches JetStream state
nats consumer info facts.operational enterprise-from-plant-a --server nats://localhost:4222
# Expected: Delivered matches cursor, AckFloor matches or lags (processing separate)
Expected result: Cursor monotonically advanced, durable position confirmed.
6.6 Advisory Notification Test (Optional)
If using push notifications for latency optimization:
# Zone A sends advisory (simulated or actual)
# Zone B receives notification and initiates early fetch
# Verify notification received (if monitoring available)
grep "notify_available" /var/log/bsfg/ # or metric
# Verify fetch initiated promptly after notification
Expected result: Notification received, fetch initiated, but correctness does not depend on notification (polling would also work).
7. Artifact Retrieval Validation
7.1 Fact References Artifact
Identify a fact that references an artifact:
# Inspect fetched facts for artifact references
nats consumer next facts.operational enterprise-from-plant-a --server nats://localhost:4222
# Look for artifact_uri field in fact body
Expected result: Fact JSON contains artifact_uri or equivalent reference.
7.2 Artifact Fetch
Retrieve referenced artifact from peer zone:
# Fetch artifact (via BSFG or direct object store if redirected)
bsfg artifact fetch --uri s3://enterprise-bsfg-artifacts/batch-files/2025/001/batch-123.json
# Or via API:
curl -k --cert /opt/bsfg/certs/server.crt --key /opt/bsfg/certs/server.key \
-X GET "https://10.1.1.10:9443/v1/artifacts?uri=s3://enterprise-bsfg-artifacts/..."
Expected result: Artifact retrieved, content matches reference, integrity verified (content-addressed or checksum).
7.3 Integrity and Identity Policy Validation
Verify artifact integrity:
# If content-addressed: verify hash matches
sha256sum downloaded-file # Compare to reference in fact
# If redirected to object store: verify signature/checksum
Expected result: Artifact integrity confirmed, policy enforced.
7.4 Missing Artifact Behavior Test
Test handling of missing artifact:
# Request non-existent artifact
bsfg artifact fetch --uri s3://enterprise-bsfg-artifacts/batch-files/invalid/nonexistent.json
# Expected: 404 or equivalent, retry scheduled, alert generated
Expected result: Graceful degradation, retry with backoff, operational alert.
8. Partition and Recovery Drill (Controlled / Maintenance-Window Only)
Warning: This drill simulates network partition and uses firewall rules. Execute only in:
- Lab/test environments
- Scheduled maintenance windows with explicit change control
- With rollback plan documented and ready
8.1 Simulate Peer Unreachability
Block connectivity from Zone B to Zone A using approved network administration procedures:
# Example: block outbound to Zone A VIP (illustrative)
# Use your approved network administration interface:
# - iptables (example shown, use with caution)
# - Network ACLs
# - Administrative partition command if available
# Example (illustrative only):
# iptables -A OUTPUT -d 10.1.1.10 -j DROP
# Or if your BSFG realization provides partition simulation:
# bsfg admin partition --peer enterprise --reason "drill"
# Always have rollback ready:
# iptables -D OUTPUT -d 10.1.1.10 -j DROP # to restore
8.2 Verify Local Autonomy
On Zone B, verify local durable work continues:
# Producer append to local ESB must succeed
nats pub facts.operational.test "{\"test\": \"partition-drill\"}" --server nats://localhost:4222
# Expected: OK
# Local consumer from IFB must continue
nats consumer next facts.operational local-consumer --server nats://localhost:4222
# Expected: Facts available (may be stale if no local production)
Expected result: Local autonomy preserved, no blocking on remote unavailability.
8.3 Verify Backlog Accumulation
Monitor outbound buffer (ESB) growth:
nats stream info facts.operational --server nats://localhost:4222
# Expected: Messages count increasing (if producers active)
# Or: specific ESB/EFB metrics showing accumulation
Expected result: Backlog accumulates for affected peer relationship, no data loss.
8.4 Restore Connectivity
Remove block or end administrative partition:
iptables -D OUTPUT -d 10.1.1.10 -j DROP
# Or: bsfg admin reconcile --peer enterprise
8.5 Verify Cursor Reconciliation
Monitor automatic recovery:
# Watch logs for reconciliation
journalctl -u bsfg-controller -f | grep -E "(reconcile|cursor|replay)"
# Expected sequence:
# - "peer enterprise reachable"
# - "cursor comparison: local=X, peer=Y"
# - "backfill required: Y-X facts"
# - "replay initiated"
# - "cursor advanced to Y"
Expected result: Automatic cursor comparison, gap detection, backfill initiation.
8.6 Verify Replay/Backfill
Confirm facts replayed successfully:
# Check that missing facts were backfilled
nats stream info facts.operational --server nats://localhost:4222
# Expected: Messages count includes backfilled facts, no duplicates (idempotent)
# Verify cursor recovery (use approved interface)
# Example (illustrative): bsfg cursor query --peer enterprise --stream facts.operational
# Expected: Cursor has advanced monotonically from pre-recovery value;
# backlog cleared or decreasing; no duplicate side effects observed
Expected result: Backfill complete, cursor monotonically advanced from pre-recovery position, no destructive re-init required.
Note: Cursor values may not be directly comparable across zones; verify monotonic recovery and completeness, not equality.
8.7 Verify No Destructive Re-Init
Confirm local state preserved using continuity-based checks:
# Verify stream/store identity persisted (not replaced)
nats stream info facts.operational --server nats://localhost:4222
# Expected: Stream identity stable, FirstSeq <= previous FirstSeq (no reset)
# LastSeq > previous LastSeq, sequences monotonic
# Verify no operator-triggered reset occurred
# Check configuration management logs for unapproved changes
# Logs may provide supporting evidence but are not primary proof:
# grep -i "wipe\|reset\|re-initialize" /var/log/bsfg/ # Supporting only
Expected result: Normal reconciliation, state continuity preserved, no state destruction.
9. Handoff Criteria
The federation relationship is accepted only when all criteria pass:
| Criterion | Verification | Required | Status |
|---|---|---|---|
| Authentication works | mTLS handshake succeeds, peer identity verified per policy | Yes | [ ] |
| Authorization works | Allowed streams fetch successfully, denied streams rejected | Yes | [ ] |
| Replay works | Fetch returns facts, cursor advances | Yes | [ ] |
| Cursor advances correctly | Monotonic, durable, matches local append | Yes | [ ] |
| Duplicate replay harmless | Idempotent append confirmed (re-fetch same cursor, no duplicates) | Yes | [ ] |
| Partition recovery works | Simulated partition, autonomous operation, clean reconciliation | Yes | [ ] |
| Artifact retrieval works | Referenced artifacts fetchable, integrity verified | Where enabled by policy | [ ] |
| Alerts/metrics visible | Replication lag, backlog, auth failures visible in monitoring | Yes | [ ] |
| Cursor initialization policy documented | Per-stream init mode recorded with justifications | Yes | [ ] |
| Authorization matrix documented | Peer allow-list, stream permissions, artifact access recorded | Yes | [ ] |
| Duplicate replay drill performed | Explicit re-fetch test confirms idempotency | Where exercised here; may be deferred to validation checklist | [ ] |
Sign-off:
| Role | Name | Date | Signature |
|---|---|---|---|
| Federation Engineer | |||
| Zone A Platform Lead | |||
| Zone B Platform Lead | |||
| Security/Compliance (if required) |
10. Post-Bring-Up Reference
| Document | Purpose |
|---|---|
| Checklist: Cross-Zone Federation Validation | Formal acceptance and audit |
| Runbook: Triad-HA Zone Deployment | Add more zones to federation |
| Reference Interaction Pattern: Cross-Zone BSFG Federation | Architecture reference |
| Reference Deployment Pattern: Triad-HA | Intra-zone substrate reference |