BSFG Peer Replication Protocol

Context: Multi-Zone Deployments

Industrial deployments typically span multiple zones separated by trust, network, and operational boundaries:

flowchart LR
  subgraph EZ["Enterprise Zone (IT Systems)"]
    EZ_NODE["BSFG Node"]
    EZ_STORE["Durable Fact Store"]
    EZ_APPS["Apps"]
    EZ_NODE --> EZ_STORE
    EZ_APPS --> EZ_NODE
  end

  subgraph IDMZ["IDMZ (Mediation Layer)"]
    IDMZ_NODE["BSFG Node"]
    IDMZ_STORE["Durable Fact Store"]
    IDMZ_GW["Gateway"]
    IDMZ_NODE --> IDMZ_STORE
    IDMZ_GW --> IDMZ_NODE
  end

  subgraph OT["Plant OT Zone (Operational Equip)"]
    OT_NODE["BSFG Node"]
    OT_STORE["Durable Fact Store"]
    OT_APPS["PLC / Historian"]
    OT_APPS --> OT_NODE
    OT_NODE --> OT_STORE
  end

Each zone runs a local BSFG node with its own durable substrate implementing the BSFG storage ports. Zones are operationally independent: each can survive the failure or disconnection of any other zone.

Yet facts must flow between zones reliably. The mechanism for this flow is the peer replication protocol — a receiver-driven replay mechanism where BSFG nodes pull facts from each other asynchronously, confirm receipt, and advance shared cursors.

Protocol Overview

The peer replication protocol is fundamentally pull-driven. The receiving zone initiates fetch requests; the sending zone returns facts without pushing.

Basic Replication Relationship

flowchart TD
  subgraph SZ["Sending Zone (holds authoritative facts)"]
    ISB["ISB (store)"]
    IFB["IFB"]
    SCT["Cursor Tracker (frontier)"]
    ISB --> SCT
    IFB --> SCT
  end

  FETCH["FetchFacts(consumer_name, limit)"]
  CONFIRM["ConfirmReceipt(consumer_name, offset)"]
  ADV["Cursor advances (frontier moves)"]

  subgraph RZ["Receiving Zone (pulls facts)"]
    ESB["ESB (store)"]
    EFB["EFB"]
    RCT["Cursor Tracker"]
    ESB --> RCT
    EFB --> RCT
  end

  SCT --> FETCH --> RCT
  RCT --> CONFIRM --> SCT
  CONFIRM --> ADV

Key Characteristics

Receiver-driven (Pull): The receiving zone calls FetchFacts when ready. The sending zone does not push.
Passive Sender: The sending zone holds facts durably and responds to fetch requests. No state change until confirmation arrives.
Asynchronous: Sending zone does not wait for consumer processing. Receiving zone can batch, throttle, or pause without blocking the sender.
Replay-Based: Replication is a replay of appended facts, not a state transfer. Facts are immutable; replay is deterministic.

Replication Loop: Step by Step

The replication loop runs continuously or on-demand. Each iteration moves facts from sender to receiver.

Step 1: Receiver Requests New Facts

Receiver calls:
  FetchFacts(
    consumer_name: "plant-a-receiver",
    limit: 100
  )

Parameters:
  consumer_name — identifies the replication relationship
  limit — batch size (e.g., 100 facts per fetch)

The consumer name is durable. It tells the sender which cursor position to use. The sender looks up the last confirmed offset for this consumer and returns facts from last_confirmed_offset + 1 onward.

Step 2: Sender Returns Facts

Sender responds:
  batch = [
    {offset: 100, fact: {...}, envelope: {...}},
    {offset: 101, fact: {...}, envelope: {...}},
    ...
    {offset: 199, fact: {...}, envelope: {...}}
  ]

Response includes:
  offsets — immutable, assigned by sender's store buffer
  facts — the appended messages
  envelopes — metadata, from_zone, message_id, etc.

The sender returns up to limit facts. If fewer than limit facts are available, it returns all available facts. An empty batch signals "no new facts at this moment."

Step 3: Receiver Appends Facts Locally

For each fact in batch:
  AppendFact(
    envelope: {...},
    fact: {...}
  ) → local_offset

Receiver's local append:
  - facts go into receiver's store buffer (ESB/ISB)
  - offsets are assigned locally (independent of sender's offsets)
  - facts are durably persisted

The receiver's local offsets are independent of the sender's offsets. The mapping between sender offset and receiver offset is tracked internally (or not at all — only the sender's offset matters for cursor advancement).

Step 4: Receiver Confirms Receipt

After all facts in batch are durably appended:
  ConfirmReceipt(
    consumer_name: "plant-a-receiver",
    offset: 199  — highest offset from the batch
  ) → {cursor_advanced_to: 199}

Confirmation includes:
  consumer_name — identifies the relationship
  offset — highest offset successfully appended locally

Confirmation is the critical signal. It tells the sender: "I have received and durably stored facts up to offset 199."

Step 5: Sender Advances Frontier

Sender updates cursor tracker:
  Cursor(consumer_name="plant-a-receiver") = 199

Frontier semantics:
  highest_contiguous_committed_offset = 199

Sender can now:
  - truncate facts 0-199 from its store buffer (if TTL allows)
  - reduce storage pressure
  - know which facts have been safely replicated

Step 6: Loop Continues

The receiver calls FetchFacts again. The sender returns facts from offset 200 onward. The loop repeats until there are no more facts to replicate.

Cursor Semantics and Frontier Rules

Cursors are the mechanism that enables safe, deterministic replication. Each replication relationship maintains two related positions:

Cursor Position (Receiving Zone View)

Receiver tracks:
  durable_consumer_position = last_confirmed_offset

After confirming offset 199:
  receiver next calls: FetchFacts(limit=100)
  sender will return: facts from offset 200 onward

Frontier Position (Sending Zone View)

Sender tracks (per receiving zone):
  highest_contiguous_committed_offset = 199

This frontier is used for:
  - truncation decisions (can delete offsets 0-199)
  - replication progress monitoring
  - recovery checkpoint

Contiguous Frontier Rule (ADR-0004)

The frontier advances only over contiguous confirmed offsets. If the receiver confirms offset 100 but then crashes before confirming 101, the frontier cannot advance to 102 until 101 is confirmed.

Sender store buffer: [0, 1, 2, ..., 100, 101, 102, 103]
Confirmations:       [Y, Y, Y, ..., Y,   N,   Y,   Y  ]  (101 missing)
Frontier:            100  (stops at first gap)

After receiver retries and confirms 101:
Frontier:            103  (advances to contiguous prefix 0-103)

This rule ensures that truncation is safe: we only delete facts that have been confirmed by all consumers, in order.

Failure Recovery

The peer replication protocol is designed to tolerate failures. Recovery is deterministic and requires no external orchestration.

Network Partition

Scenario: Receiving zone loses connectivity to sending zone

Behavior:
  - FetchFacts calls timeout
  - Receiver stops pulling facts
  - Sending zone buffer accumulates facts
  - Receiver continues locally (no blocking)

Duration: Minutes to hours

Recovery:
  1. Connectivity restored
  2. Receiver retries FetchFacts
  3. Sender returns facts from (last_confirmed_offset + 1) onward
  4. Receiver re-appends facts (idempotent)
  5. Confirmations resume
  6. Frontier advances

Receiver (BSFG Node) Crash

Scenario: Receiving zone BSFG node crashes during a fetch/confirm cycle

Behavior:
  - Receiver restarts
  - Last confirmed offset is recovered from durable consumer
  - Receiver resumes fetching from (last_confirmed_offset + 1)
  - Some facts in new batch may have been partially appended
  - Re-append them (idempotent append prevents duplicates)
  - Confirm again

Sender (BSFG Node) Crash

Scenario: Sending zone BSFG node crashes with unconfirmed facts in buffer

Behavior:
  - Sender restarts
  - Store buffer is recovered from durable storage
  - Cursor tracker is recovered
  - Receiver calls FetchFacts
  - Sender returns unconfirmed facts (from last confirmed offset + 1 onward)
  - Receiver appends, confirms
  - Frontier advances normally

Consumer Backlog Accumulation

Scenario: Receiving zone consumer processes facts very slowly

Behavior:
  - Sender accumulates unconfirmed facts
  - Buffer fill percentage increases
  - Replication lag grows
  - Receiving zone continues operating independently

Duration: Hours to days (slow consumer)

Recovery:
  - Consumer accelerates
  - Confirmations resume
  - Frontier advances
  - Sending zone buffer drains

Delivery Guarantees

The peer replication protocol provides clear, testable guarantees about how facts move between zones.

At-Least-Once Delivery

Every fact successfully appended to the sender's store buffer will be delivered at least once to the receiver. Facts are never lost due to network partition or sender/receiver restart.

Sender retains facts until confirmed (TTL enforced, typically 7 days)
Receiver recovers cursor on restart, can retry FetchFacts
Network partition does not lose facts; replication resumes on reconnect

Idempotent Append

If the same fact is sent twice (due to receiver retry, sender recovery), the receiver's append is idempotent: the fact is stored once.

Sender offset 100 contains fact F with message_id "evt-123"

Scenario A: Normal flow
  Receiver fetches offset 100 → appends F locally with message_id "evt-123"
  Receiver confirms offset 100

Scenario B: Receiver crashes and retries
  Receiver recovers from durable consumer → position is offset 99
  Receiver fetches from offset 100 again → gets F again
  Receiver tries to append F with message_id "evt-123"
  Forward buffer's putIfAbsent rejects it (already exists)
  Receiver confirms offset 100 again
  Result: ONE copy of F in receiver's buffer, not two

Eventual Consistency

All replication relationships converge to the same state given enough time and connectivity.

Sender and receiver maintain durable cursors
Cursors track replication progress
Facts are immutable; replay is deterministic
No distributed transaction or voting needed; causality is enforced by cursor order

No Exactly-Once

BSFG does not guarantee exactly-once delivery. Receivers must tolerate and handle duplicate delivery by implementing idempotent processing (deduplication by message_id or natural key).

No Synchronous Confirmation

FetchFacts and ConfirmReceipt are asynchronous. Confirmation does not mean the receiving zone's consumer has processed the facts, only that they are durably received.

Multi-Zone Topology

In a three-zone deployment, replication relationships form a network:

flowchart TD
  E["Enterprise"] <--> I["IDMZ"]
  E --> P["Plant A"]
  I --> P
  E -. optional direct .-> P

Replication relationships:

Enterprise → IDMZ (enterprise pulls from IDMZ, or vice versa)
Enterprise → Plant A
IDMZ → Plant A
(optional) Enterprise → Plant A (direct, bypassing IDMZ)

Each relationship is independent:

separate cursors
separate confirmation state
can fail/recover separately

Performance Characteristics

Under healthy, normal operation:

FetchFacts latency: 1–50ms (depends on batch size, network, sender load)
Replication lag: < 100ms (time from append at sender to confirm at receiver)
Throughput: Sender-limited by storage write speed; typically 10k–100k facts/sec per node
Cursor advancement: Immediate (once ConfirmReceipt is received and durable)

Under partition or overload:

Replication lag: Minutes to hours (depends on partition duration or backlog size)
Throughput: Receiver-limited by processing speed and backpressure policy
Zone autonomy: Maintained; zones can operate independently until reconnect

Replication vs. Consumption

It's important to distinguish between replication (zone-to-zone) and consumption (application processing):

Replication: BSFG node fetches facts from peer zone, appends to local store, confirms. The receiving zone now has a durable copy.
Consumption: Application process reads facts from local forward buffer via FetchFacts, processes idempotently, confirms. Only then does the application's business logic execute.

Replication and consumption are decoupled. A fact may be replicated to the receiving zone long before any consumer processes it. This allows producers to complete without waiting for consumer processing.

The peer replication protocol builds on several foundational BSFG concepts:

Replay Model — Explains the four-step handoff and asynchronous replay semantics
Boundary Roles — Describes ISB, IFB, ESB, EFB and how they are implemented
Message Model — Details envelope and fact structure used in replication
Integration Contract — Defines producer/consumer obligations in multi-zone systems
ADR-0032: Pull-Driven Transfer — Normative decision for receiver-driven replication
ADR-0033: At-Least-Once Delivery — Normative decision for delivery semantics
ADR-0004: Contiguous Frontier — Normative decision for safe truncation

← ← Back: Artifact Handling Next: Security Model → →