Reference Interaction Pattern: Cross-Zone BSFG Federation
Pattern Name
Cross-Zone BSFG Federation — Reference interaction pattern for asynchronous, authenticated exchange among autonomous BSFG zones.
Classification
| Attribute | Value |
|---|---|
| Layer | Inter-zone interaction |
| Kind | Reference interaction pattern |
| Scope | Two or more autonomous BSFG zones |
| Consistency model | Eventual, asynchronous, cursor-driven |
| Availability model | No zone depends on remote zone reachability to accept local durable work |
| Security model | mTLS-authenticated peer federation with explicit authorization |
| Recovery model | Replay and reconciliation from durable checkpoints |
Intent
Defines the architectural contract by which autonomous BSFG zones exchange durable state, recover from partitions, and re-establish monotonic progress without introducing cross-zone availability dependencies or shared control-plane assumptions.
This pattern complements the Triad-HA deployment pattern, which specifies how one zone survives its own failures. Cross-Zone BSFG Federation specifies how autonomous zones interact when they cannot rely on each other being reachable.
Key Distinctions
| Distinction | Meaning |
|---|---|
| Intra-zone quorum ≠ cross-zone consistency | RAFT-backed durability within a zone does not imply synchronous consistency across zones |
| Notification ≠ durable acceptance | A peer's advisory signal does not constitute confirmed durable receipt |
| Connectivity ≠ authorization | Network reachability does not imply permission to exchange data |
| Replay ≠ conflict-free merge | Recovery replays from checkpoint; it does not reconcile divergent mutable state |
| Artifact recovery ≠ fact replication | Large binary artifacts use distinct fetch semantics from message stream replay |
| Peer federation ≠ cluster formation | Zones cooperate; they do not form a single distributed system with shared control plane |
| Durable append ≠ downstream processing completion | A fact may be durably replicated into a zone before local consumers have processed it |
Applicability
Use this pattern when:
- Zones must remain operationally autonomous during network partitions
- Cross-zone exchange must not block local durable work
- Peers may be slow, intermittently reachable, or offline for extended periods
- Recovery from outage must not require destructive resynchronization
- Hub-and-spoke, chain, or selective mesh federation variants are required
- Enterprise/IDMZ/Plant boundaries must be traversed without shared middleware
Non-Goals
This pattern explicitly does not provide:
- Intra-zone failover mechanics (see Triad-HA)
- Local host sizing or JetStream clustering details
- Kubernetes, service meshes, or distributed orchestration primitives
- Cross-zone synchronous commit or two-phase transaction semantics
- Globally consistent total ordering across all zones
- Cross-zone consensus, leader election, or shared control plane
- Automatic conflict resolution for divergent mutable state
- Guaranteed artifact availability during initial cross-zone handshake
Invariants
The following invariants must hold in all deployments using this pattern:
Local autonomy under partition — A zone accepts locally durable work without requiring remote zone availability.
Asynchronous exchange — Cross-zone propagation is explicitly non-blocking; no zone waits for peer acknowledgment before local commit.
Idempotent replay — Replayed or re-delivered facts must be harmless to idempotent consumers.
Cursor-driven recovery — Post-partition reconciliation advances from last durable checkpoint, not from arbitrary state comparison.
No global ordering — Zones do not assume or enforce total order across zone boundaries; ordering is scoped to an exported stream.
Artifact/fact separation — Binary artifacts and fact messages use distinct durability and retrieval semantics.
Autonomous mode persistence — Partitioned zones continue local operation using ISB/IFB/ESB/EFB without cross-zone coordination.
Non-destructive reconnection — Rejoining after partition must not require wiping local state or full re-initialization.
Durable receipt precedes progress publication — A zone must not advance or advertise cursor progress beyond what is durably appended locally.
Interaction Model
Cursor Semantics
A cursor is a monotonically advancing, durable checkpoint representing the highest cross-zone fact position that a receiving zone has durably appended for a specific exported stream from a specific peer.
A cursor is:
- scoped to one receiver, one sender, and one exported stream
- advanced only after durable local append
- not equivalent to consumer processing completion
- not a global sequence number across zones
Cursor Initialization
Each peer/exported-stream relationship must define an explicit cursor initialization policy at federation bring-up time.
Permitted initialization forms include:
- start-now
- bounded historical backfill
- full backfill
- operator-approved seed position or timestamp
This pattern requires initialization to be explicit. Operational selection, defaulting, and justification belong in federation bring-up procedures rather than this architectural reference.
Zone Identity
Each zone possesses a stable, cryptographically bound identity:
- Zone name — Deployment-scoped identifier (e.g.,
enterprise,plant-a) - Certificate identity — Peer certificate subject or SAN must match configured zone identity according to local policy
- Identity scope — Authorization policy determines which peer zones may connect
Peer Relationship Model
| Aspect | Semantics |
|---|---|
| Relationship | Explicitly configured, not auto-discovered |
| Directionality | Bidirectional capability; unidirectional data flow per exported stream |
| Cardinality | One-to-one, one-to-many, or many-to-many as configured |
| Lifecycle | Long-lived; reconnection resumes from checkpoint |
Sender vs Receiver Roles
| Role | Responsibility |
|---|---|
| Sender (originating zone) | Appends facts to local outbound boundary roles; makes them available for fetch; does not await remote confirmation |
| Receiver (target zone) | Polls or accepts advisory notification; fetches via cursor; durably appends before advancing progress |
Default Exchange Mode: Receiver-Driven Cursor-Based Fetch
The canonical cross-zone interaction is receiver-driven:
- Receiving zone maintains durable cursor position per peer and exported stream
- Receiving zone periodically initiates fetch against peer endpoint, supplying last known durable cursor
- Peer responds with facts from that position, bounded by batch constraints
- Receiving zone durably appends to local inbound boundary roles, then advances cursor
- Optional: receiving zone may emit advisory progress confirmation as a non-authoritative optimization
Advisory notification is permitted as a latency optimization:
- Sender may notify that new facts are available
- Receiver must not treat notification as durable acceptance
- Receiver still performs explicit fetch and local durable append before progress advancement
No correctness property depends on advisory notification or advisory confirmation paths.
Acknowledgment Semantics
| Signal | Meaning | Reliability |
|---|---|---|
| Fact append (local) | Durable in local JetStream | Guaranteed by RAFT quorum |
| Fetch response | Facts transmitted over mTLS | At-least-once transport |
| Remote durable append | Facts durably appended in receiving zone | Reflected by cursor advancement |
| Advisory notification | Hint only; no durability claim | May be lost or reordered |
Exchange Primitives
This document defines architectural primitives and semantics, not a normative wire format, CLI shape, or endpoint schema. Concrete operator interfaces belong in runbooks and implementation documentation.
| Primitive | Direction | Durable? | Purpose |
|---|---|---|---|
| FetchFacts | Receiver → Sender | No (transport only) | Replay facts from supplied cursor |
| NotifyAvailable | Sender → Receiver | No | Advisory latency optimization |
| QueryCursor | Either → Either | No | Discover peer-reported durable cursor position |
| FetchArtifact | Receiver → Sender | No (fetch) | Retrieve binary payload by reference |
| HealthCheck | Either → Either | No | Verify reachability and identity |
| BackfillRange | Receiver → Sender | No | Replay bounded historical range or equivalent bounded recovery request on gap detection |
Primitive Semantics
FetchFacts (canonical)
- Request includes: receiver's last durable cursor and fetch bounds such as batch size and wait limit
- Response includes: facts from that position, next cursor, and end-of-stream indicator
- Retry: On transport failure, receiver retries with the same cursor
NotifyAvailable (advisory)
- Sender indicates approximate availability of new facts
- Receiver may choose to fetch immediately or continue its polling schedule
- Lost notifications are harmless; polling remains the correctness baseline
FetchArtifact
- Request uses an artifact reference from a fact, such as URI or content address
- Response returns binary payload or redirect to zone-local object store
- Content-addressed verification is recommended where applicable
Consistency and Ordering Semantics
What Is Guaranteed
| Property | Scope | Mechanism |
|---|---|---|
| Local durability | Within zone | JetStream RAFT, configured sync policy |
| Monotonic cursor | Per peer/exported-stream relationship | Durable checkpoint in receiving zone |
| Idempotent append | Cross-zone | putIfAbsent or equivalent at storage interface |
| Per-stream ordering | Within one exported stream | Stream semantics of originating zone |
What Is Not Guaranteed
| Property | Why Absent |
|---|---|
| Global total order | No cross-zone clock synchronization or sequencing service |
| Synchronous replication | Design explicitly rejects blocking on remote durability |
| Cross-zone linearizability | Zones observe each other via replay, not shared memory |
| Immediate artifact availability | Artifacts may require separate fetch; not inlined in fact stream |
Duplicate Handling
- Facts carry stable
message_idderived from business event - Receiving zone storage interface enforces
putIfAbsent - Replayed duplicates are discarded at storage layer
- Consumers must also be idempotent as defense in depth
Cursor Advancement
- Cursor represents durable local append position, not processed position
- Consumer processing lag is separate from replication cursor
- Cursor advancement is irreversible within a zone
- Cross-zone progress is monotonic per peer/exported-stream relationship, not globally synchronized
Failure Model
| Failure Class | Scenario | Required Behavior |
|---|---|---|
| Peer unreachable | Complete network partition to a specific peer | Continue local acceptance; accumulate backlog for that peer relationship; retry with backoff |
| Asymmetric reachability | A→B reachable, B→A not | Receiver on blocked side cannot fetch; sender may notify into void; no blocking; eventual retry when path restores |
| Stale/invalid certificates | mTLS handshake fails | Reject connection; alert operator; do not bypass or degrade |
| High latency / intermittent | >5s round-trip, packet loss | Exponential backoff; batch size adaptation; alert on threshold breach |
| Long partition | Hours to days of isolation from a peer | Autonomous mode persists for that peer relationship; buffers accumulate; operator alert on threshold |
| Complete peer zone loss | Peer permanently destroyed or decommissioned | Local zone continues; treat peer as unavailable until explicit replacement or re-authorization |
| Local zone survives, peers lost | Network or peer outage | Full local autonomy; no local durability degradation; outbound backlog growth monitored |
| Peer returns with stale cursor | Peer restored from older backup | Cursor comparison detects lag; automatic backfill or operator intervention |
| Peer returns with incompatible history | Non-prefix history or corrupted cursor | Halt automatic reconciliation; require operator investigation |
Failure Outcome Summary
| Outcome | Condition |
|---|---|
| Continue | Local zone always continues accepting work |
| Queue | Outbound facts accumulate for the affected peer relationship |
| Deny | Peer exchange stops; no remote blocking |
| Retry | Automatic with exponential backoff |
| Replay | On reconnection, resume from durable cursor |
| Operator | Only when invariants cannot be re-established automatically |
Partition Behavior
Entry into Autonomous Mode
Triggered per peer relationship by:
- Peer unreachable after retry threshold
- mTLS authentication failure
- Explicit administrative partition command
Local behavior:
- Local producer append continues
- Local consumer processing continues against already durable local state
- Fetch to the affected peer stops
- Cursor position for the affected peer/exported-stream relationship freezes
- Alert generated:
partition_detected
During Partition
- Producers: Non-blocking append to local outbound boundary roles
- Consumers: Continue from local inbound boundary roles and may become stale relative to remote peers
- Buffers: Accumulate for the affected peer relationship
- Artifacts: References remain valid locally; fetch from unreachable peer fails
Exit from Autonomous Mode
Triggered per peer relationship by:
- Peer reachable and authenticated
- Health check passes
- Optional: operator explicit
reconcilecommand
Recovery sequence:
- Health handshake verifies peer identity and liveness
- Cursor query compares positions
- If gap detected: backfill from lower cursor
- If incompatible history detected: operator intervention
- Normal fetch resumes
- Alert cleared:
partition_resolved
Reconciliation and Recovery
Post-Partition Reconciliation
| Step | Action | Actor |
|---|---|---|
| 1 | Verify peer identity and health | Both zones |
| 2 | Query peer-reported cursor position | Receiving zone |
| 3 | Compare with local durable cursor | Receiving zone |
| 4a | If local receiver is behind: fetch from peer cursor | Receiving zone |
| 4b | If peer receiver is behind: peer fetches from local | Peer zone |
| 4c | If bounded gap detected: request backfill range | Receiving zone |
| 5 | Resume normal fetch | Both zones |
Gap Handling
- Small gap (< batch size): Automatic backfill via extended fetch
- Large gap: Explicit bounded range request or operator decision
- Cursor invalid: Operator resets cursor or performs full re-initialization; destructive actions require explicit justification
Artifact Rehydrate
- Facts replicate via cursor-based replay
- Artifacts may be missing on receiving zone after failover or long partition
- Artifact fetch is on-demand or background, not inline with fact replication
- Missing artifact on fetch triggers retry with backoff, alerting, and optional background rehydrate job
- Artifact exchange may be enabled or disabled per peer relationship and policy scope; fact replay does not imply unrestricted artifact access
When Replay Is Insufficient
| Scenario | Action |
|---|---|
| Peer zone replaced after loss | Restore from backup or initialize replacement zone; require explicit authorization |
| Complete logical corruption | Operator wipe and re-initialize; replay from surviving peers or backup |
| Invariant violation detected | Halt cross-zone exchange; operator investigation |
Backpressure and Buffer Semantics
The four-buffer names below refer to logical boundary roles. Implementations may realize them as one or more physical streams, stores, or queue views.
| Buffer | Direction | Cross-Zone Role | Backpressure Trigger |
|---|---|---|---|
| ISB | Inbound | Logical ingress role for accepted peer facts | Ingress fill > threshold |
| IFB | Inbound | Logical handoff role for local consumers | Consumer lag > threshold |
| ESB | Outbound | Logical egress staging for facts made available to peers | Egress fill > threshold |
| EFB | Outbound | Logical delivery-facing role used during peer transfer | Delivery-facing fill > threshold |
Buffer Thresholds and Policy
| Condition | Threshold | Policy | Alert |
|---|---|---|---|
| Ingress fill high | >80% | Reject or defer additional peer intake to preserve local capacity | Tier 1 |
| Outbound fill high | >80% | Apply producer backpressure, rejection, or defer policy per stream class | Tier 1 |
| Delivery-facing fill high | >80% | Continue fetch where possible; escalate producer backpressure if sustained | Tier 1 |
| Consumer lag high | >10,000 facts | Scale consumers or investigate downstream slowness; alert if sustained | Tier 2 |
| Cross-zone replication lag high | >60 seconds | Investigate network, peer health, or policy mismatch | Tier 1 |
Retry and Backoff
| Operation | Initial Interval | Backoff | Max Interval | Circuit Breaker |
|---|---|---|---|---|
| FetchFacts (normal) | 1 second | Exponential 2x | 30 seconds | After 5 consecutive failures |
| FetchFacts (partitioned) | 5 seconds | Exponential 2x | 5 minutes | Manual or health-check reset |
| Artifact fetch | 1 second | Linear + jitter | 60 seconds | Per-artifact failure tracking |
Security Contract
Authentication
| Layer | Mechanism | Verification |
|---|---|---|
| Transport | mTLS 1.2+ | Certificate chain to shared or cross-signed CA |
| Identity | Certificate subject or SAN | Must match configured zone identity per local policy |
| Authorization | Explicit allow-list | Zone A explicitly authorized to exchange with Zone B |
Authorization Scope
- Peer matrix: Configuration specifies which zones may connect
- Stream scoping: Authorization may restrict which exported streams are visible per peer
- Artifact scoping: Artifact references may be filtered or redirected based on peer authorization
Certificate Lifecycle
| Event | Action |
|---|---|
| Rotation (planned) | Rolling restart across zone nodes; peer reconnection with new certificate |
| Expiry approaching (<30 days) | Alert Tier 2; schedule rotation |
| Expiry imminent (<7 days) | Alert Tier 1; prepare partition if rotation fails |
| Post-expiry connection attempt | Reject; alert; require operator intervention |
Trust Failure Behavior
| Scenario | Response |
|---|---|
| Unknown CA | Reject; log; alert |
| Mismatched identity | Reject; log; alert |
| Revoked certificate | Reject; log; alert; check CRL/OCSP if configured |
| Clock skew (TLS validity window) | Reject; alert; investigate NTP |
Federation Variants
Chain: Enterprise ↔ IDMZ ↔ Plant
[Enterprise] ←→ [IDMZ] ←→ [Plant A]
←→ [Plant B]
- Purpose: Mediated boundary with inspection or mapping zone
- Preserves: IDMZ as non-transparent relay; no direct Enterprise–Plant connectivity
- Complicates: Latency, additional cursor hop, IDMZ bottleneck risk
Hub-and-Spoke: Enterprise Center
[Plant A]
↑
[Plant B] ← [Enterprise] → [Plant C]
↓
[Plant D]
- Purpose: Central aggregation and control-point federation
- Preserves: Simple peer matrix; Enterprise as integration hub
- Complicates: Central load concentration; broader blast radius of Enterprise partition
Selective Mesh: Plant-to-Plant
[Plant A] ←→ [Plant B]
↕ ↕
[Plant C] ←→ [Enterprise]
- Purpose: Direct peer coordination where justified
- Preserves: Autonomy without mandatory hub
- Complicates: O(N²) peer matrix and multiplied partition paths
Bilateral: Two-Zone Partnership
[Zone A] ←→ [Zone B]
- Purpose: Simplest direct federation
- Preserves: All invariants with minimal complexity
- Complicates: No structural indirection or traffic isolation layer
Assisted Transfer: Intermittently Connected
[Plant A] ←→ [Satellite Link] ←→ [Enterprise]
- Purpose: High-latency or intermittently connected environments
- Preserves: Local autonomy with large backlog tolerance
- Complicates: Extended autonomous periods and large replay windows
Operational Procedures
Planned Maintenance Partition
- Verify available buffer headroom before partition
- Suspend fetch to the affected peer relationship while ensuring local autonomy remains intact
- Require authenticated health handshake and cursor reconciliation before normal flow resumes
Unplanned Partition Recovery
- Detect via health-check failure and raise alert
- Verify local autonomous operation continues
- Attempt reconnection with exponential backoff
- On reachability restore: verify identity and compare cursors
- Perform automatic backfill for small gaps; require operator decision for large gaps or incompatible history
- Resume normal fetch and clear alert
Peer Certificate Rollover
- Generate new certificate with overlapping validity
- Deploy to zone nodes via rolling restart
- Verify peer accepts new certificate
- Monitor for rejection errors
Zone Rejoin After Outage
| Scenario | Procedure |
|---|---|
| Zone restored from backup | Verify cursor position; backfill from peers; explicit re-authorization if identity changed |
| Zone rebuilt as new identity | Treat as new peer; require explicit authorization; no automatic trust |
| Zone returns with incompatible history | Operator investigation; possible wipe and re-initialize |
Backlog Drain
- Monitor drain rate after partition
- Investigate sustained high lag: network capacity, peer capacity, or policy mismatch
- Check for persistent partition or peer rejection if backlog does not drain
Recovery Validation
| Check | Method |
|---|---|
| Peer connectivity | Health check |
| Authentication | mTLS handshake success |
| Authorization | Authorization policy permits fetch for configured exported streams |
| Cursor monotonicity | Query returns expected durable cursor progression |
| Replication flow | Lag metric decreases toward zero |
| Idempotency | Duplicate replay test, optional |
Validation Note
Validation of incompatible-history handling, large-gap recovery, and partition behavior may require controlled drills. Such drills are not part of ordinary steady-state operation.
Relationship to Triad-HA
Triad-HA specifies the recommended intra-zone substrate:
- Three-node JetStream quorum
- Keepalived-based controller failover
- Host-level deployment with no Kubernetes
Cross-Zone BSFG Federation assumes each zone implements Triad-HA, or an equivalent autonomy-preserving substrate, and specifies:
- How autonomous zones interact
- What cross-zone contracts must hold
- How partition and recovery behave
Critical separation:
- Cross-zone correctness must not depend on the internal failover details of any peer
- A zone's published interaction contract — endpoints, certificates, cursor behavior, and authorization policy — is the only assumption permitted
- Peers treat each other as black boxes that honor the federation contract
References
- BSFG Architecture Map: Three-layer ontology (principle, logical, substrate)
- ADR-0001: Boundary Must Contain No Durable Middleware
- ADR-0002: Four-Buffer Topology Is the Minimal Partition-Tolerant Boundary
- ADR-0006: Boundary Communication Is Asynchronous Replay
- ADR-0011: Boundary Identity Uses Mutual TLS
- ADR-0029: Cross-Zone Synchronization Uses BSFG Peer Protocol, Not Native Stream Mirroring
- ADR-0032: Cross-Zone Transfer Is Pull-Driven by the Receiving Zone
- ADR-0042: Four-Buffer Entities Are Boundary Roles Implemented by BSFG Nodes
- Triad-HA Deployment Pattern: Intra-zone substrate realization
- NATS JetStream clustering and replication documentation