Audience: Platform engineers, implementers. Use: Understand cross-zone replication flow, cursors, confirmations, and recovery behavior.
Context: Multi-Zone Deployments
Industrial deployments typically span multiple zones separated by trust, network, and operational boundaries:
Enterprise Zone IDMZ Plant OT Zone
(IT Systems) (Mediation Layer) (Operational Equip)
│ │ │
├── BSFG Node ├── BSFG Node ├── BSFG Node
├── Durable ├── Durable ├── Durable
│ Fact Store │ Fact Store │ Fact Store
└── Apps └── Gateway └── PLC/Historian
Each zone runs a local BSFG node with its own durable substrate implementing the BSFG storage ports. Zones are operationally independent: each can survive the failure or disconnection of any other zone.
Yet facts must flow between zones reliably. The mechanism for this flow is the peer replication protocol — a receiver-driven replay mechanism where BSFG nodes pull facts from each other asynchronously, confirm receipt, and advance shared cursors.
Protocol Overview
The peer replication protocol is fundamentally pull-driven. The receiving zone initiates fetch requests; the sending zone returns facts without pushing.
Basic Replication Relationship
┌────────────────────┐
│ Sending Zone │ (holds authoritative facts)
│ BSFG Node │
│ │
│ ISB (store) ─┐ │
│ │ │
│ IFB ──────────┼─→ Cursor Tracker (frontier)
└────────────────┼──┘
│
FetchFacts(consumer_name, limit)
│
▼
┌────────────────────┐
│ Receiving Zone │ (pulls facts)
│ BSFG Node │
│ │
│ ESB (store) ───┐ │
│ │ │
│ EFB ────────── ├──→ Cursor Tracker
└────────────────┼───┘
│
ConfirmReceipt(consumer_name, offset)
│
▼
Cursor advances
(frontier moves)
Key Characteristics
- Receiver-driven (Pull): The receiving zone calls
FetchFactswhen ready. The sending zone does not push. - Passive Sender: The sending zone holds facts durably and responds to fetch requests. No state change until confirmation arrives.
- Asynchronous: Sending zone does not wait for consumer processing. Receiving zone can batch, throttle, or pause without blocking the sender.
- Replay-Based: Replication is a replay of appended facts, not a state transfer. Facts are immutable; replay is deterministic.
Replication Loop: Step by Step
The replication loop runs continuously or on-demand. Each iteration moves facts from sender to receiver.
Step 1: Receiver Requests New Facts
Receiver calls:
FetchFacts(
consumer_name: "plant-a-receiver",
limit: 100
)
Parameters:
consumer_name — identifies the replication relationship
limit — batch size (e.g., 100 facts per fetch)
The consumer name is durable. It tells the sender which cursor position to use. The sender looks up the last confirmed offset for this consumer and returns facts from last_confirmed_offset + 1 onward.
Step 2: Sender Returns Facts
Sender responds:
batch = [
{offset: 100, fact: {...}, envelope: {...}},
{offset: 101, fact: {...}, envelope: {...}},
...
{offset: 199, fact: {...}, envelope: {...}}
]
Response includes:
offsets — immutable, assigned by sender's store buffer
facts — the appended messages
envelopes — metadata, from_zone, message_id, etc.
The sender returns up to limit facts. If fewer than limit facts are available, it returns all available facts. An empty batch signals "no new facts at this moment."
Step 3: Receiver Appends Facts Locally
For each fact in batch:
AppendFact(
envelope: {...},
fact: {...}
) → local_offset
Receiver's local append:
- facts go into receiver's store buffer (ESB/ISB)
- offsets are assigned locally (independent of sender's offsets)
- facts are durably persisted
The receiver's local offsets are independent of the sender's offsets. The mapping between sender offset and receiver offset is tracked internally (or not at all — only the sender's offset matters for cursor advancement).
Step 4: Receiver Confirms Receipt
After all facts in batch are durably appended:
ConfirmReceipt(
consumer_name: "plant-a-receiver",
offset: 199 — highest offset from the batch
) → {cursor_advanced_to: 199}
Confirmation includes:
consumer_name — identifies the relationship
offset — highest offset successfully appended locally
Confirmation is the critical signal. It tells the sender: "I have received and durably stored facts up to offset 199."
Step 5: Sender Advances Frontier
Sender updates cursor tracker:
Cursor(consumer_name="plant-a-receiver") = 199
Frontier semantics:
highest_contiguous_committed_offset = 199
Sender can now:
- truncate facts 0-199 from its store buffer (if TTL allows)
- reduce storage pressure
- know which facts have been safely replicated
Step 6: Loop Continues
The receiver calls FetchFacts again. The sender returns facts from offset 200 onward. The loop repeats until there are no more facts to replicate.
Cursor Semantics and Frontier Rules
Cursors are the mechanism that enables safe, deterministic replication. Each replication relationship maintains two related positions:
Cursor Position (Receiving Zone View)
Receiver tracks:
durable_consumer_position = last_confirmed_offset
After confirming offset 199:
receiver next calls: FetchFacts(limit=100)
sender will return: facts from offset 200 onward
Frontier Position (Sending Zone View)
Sender tracks (per receiving zone):
highest_contiguous_committed_offset = 199
This frontier is used for:
- truncation decisions (can delete offsets 0-199)
- replication progress monitoring
- recovery checkpoint
Contiguous Frontier Rule (ADR-0004)
The frontier advances only over contiguous confirmed offsets. If the receiver confirms offset 100 but then crashes before confirming 101, the frontier cannot advance to 102 until 101 is confirmed.
Sender store buffer: [0, 1, 2, ..., 100, 101, 102, 103]
Confirmations: [Y, Y, Y, ..., Y, N, Y, Y ] (101 missing)
Frontier: 100 (stops at first gap)
After receiver retries and confirms 101:
Frontier: 103 (advances to contiguous prefix 0-103)
This rule ensures that truncation is safe: we only delete facts that have been confirmed by all consumers, in order.
Failure Recovery
The peer replication protocol is designed to tolerate failures. Recovery is deterministic and requires no external orchestration.
Network Partition
Scenario: Receiving zone loses connectivity to sending zone
Behavior:
- FetchFacts calls timeout
- Receiver stops pulling facts
- Sending zone buffer accumulates facts
- Receiver continues locally (no blocking)
Duration: Minutes to hours
Recovery:
1. Connectivity restored
2. Receiver retries FetchFacts
3. Sender returns facts from (last_confirmed_offset + 1) onward
4. Receiver re-appends facts (idempotent)
5. Confirmations resume
6. Frontier advances
Receiver (BSFG Node) Crash
Scenario: Receiving zone BSFG node crashes during a fetch/confirm cycle
Behavior:
- Receiver restarts
- Last confirmed offset is recovered from durable consumer
- Receiver resumes fetching from (last_confirmed_offset + 1)
- Some facts in new batch may have been partially appended
- Re-append them (idempotent append prevents duplicates)
- Confirm again
Sender (BSFG Node) Crash
Scenario: Sending zone BSFG node crashes with unconfirmed facts in buffer
Behavior:
- Sender restarts
- Store buffer is recovered from durable storage
- Cursor tracker is recovered
- Receiver calls FetchFacts
- Sender returns unconfirmed facts (from last confirmed offset + 1 onward)
- Receiver appends, confirms
- Frontier advances normally
Consumer Backlog Accumulation
Scenario: Receiving zone consumer processes facts very slowly
Behavior:
- Sender accumulates unconfirmed facts
- Buffer fill percentage increases
- Replication lag grows
- Receiving zone continues operating independently
Duration: Hours to days (slow consumer)
Recovery:
- Consumer accelerates
- Confirmations resume
- Frontier advances
- Sending zone buffer drains
Delivery Guarantees
The peer replication protocol provides clear, testable guarantees about how facts move between zones.
At-Least-Once Delivery
Every fact successfully appended to the sender's store buffer will be delivered at least once to the receiver. Facts are never lost due to network partition or sender/receiver restart.
- Sender retains facts until confirmed (TTL enforced, typically 7 days)
- Receiver recovers cursor on restart, can retry FetchFacts
- Network partition does not lose facts; replication resumes on reconnect
Idempotent Append
If the same fact is sent twice (due to receiver retry, sender recovery), the receiver's append is idempotent: the fact is stored once.
Sender offset 100 contains fact F with message_id "evt-123"
Scenario A: Normal flow
Receiver fetches offset 100 → appends F locally with message_id "evt-123"
Receiver confirms offset 100
Scenario B: Receiver crashes and retries
Receiver recovers from durable consumer → position is offset 99
Receiver fetches from offset 100 again → gets F again
Receiver tries to append F with message_id "evt-123"
Forward buffer's putIfAbsent rejects it (already exists)
Receiver confirms offset 100 again
Result: ONE copy of F in receiver's buffer, not two
Eventual Consistency
All replication relationships converge to the same state given enough time and connectivity.
- Sender and receiver maintain durable cursors
- Cursors track replication progress
- Facts are immutable; replay is deterministic
- No distributed transaction or voting needed; causality is enforced by cursor order
No Exactly-Once
BSFG does not guarantee exactly-once delivery. Receivers must tolerate and handle duplicate delivery by implementing idempotent processing (deduplication by message_id or natural key).
No Synchronous Confirmation
FetchFacts and ConfirmReceipt are asynchronous. Confirmation does not mean the receiving zone's consumer has processed the facts, only that they are durably received.
Multi-Zone Topology
In a three-zone deployment, replication relationships form a network:
Enterprise ↔ IDMZ
↑ ↓
└─ Plant A
Replication relationships:
- Enterprise → IDMZ (enterprise pulls from IDMZ, or vice versa)
- Enterprise → Plant A
- IDMZ → Plant A
- (optional) Enterprise → Plant A (direct, bypassing IDMZ)
Each relationship is independent:
- separate cursors
- separate confirmation state
- can fail/recover separately
Performance Characteristics
Under healthy, normal operation:
- FetchFacts latency: 1–50ms (depends on batch size, network, sender load)
- Replication lag: < 100ms (time from append at sender to confirm at receiver)
- Throughput: Sender-limited by storage write speed; typically 10k–100k facts/sec per node
- Cursor advancement: Immediate (once ConfirmReceipt is received and durable)
Under partition or overload:
- Replication lag: Minutes to hours (depends on partition duration or backlog size)
- Throughput: Receiver-limited by processing speed and backpressure policy
- Zone autonomy: Maintained; zones can operate independently until reconnect
Replication vs. Consumption
It's important to distinguish between replication (zone-to-zone) and consumption (application processing):
- Replication: BSFG node fetches facts from peer zone, appends to local store, confirms. The receiving zone now has a durable copy.
- Consumption: Application process reads facts from local forward buffer via
FetchFacts, processes idempotently, confirms. Only then does the application's business logic execute.
Replication and consumption are decoupled. A fact may be replicated to the receiving zone long before any consumer processes it. This allows producers to complete without waiting for consumer processing.
Related Concepts
The peer replication protocol builds on several foundational BSFG concepts:
- Replay Model — Explains the four-step handoff and asynchronous replay semantics
- Boundary Roles — Describes ISB, IFB, ESB, EFB and how they are implemented
- Message Model — Details envelope and fact structure used in replication
- Integration Contract — Defines producer/consumer obligations in multi-zone systems
- ADR-0032: Pull-Driven Transfer — Normative decision for receiver-driven replication
- ADR-0033: At-Least-Once Delivery — Normative decision for delivery semantics
- ADR-0004: Contiguous Frontier — Normative decision for safe truncation