Deployment

BSFG Triad Ha Deployment Runbook

Runbook: Triad-HA Zone Deployment

Purpose

Operationalize the deployment of one BSFG zone using the Triad-HA deployment pattern.

This runbook translates the reference architecture into ordered operator actions for preparing, installing, bootstrapping, and validating a single autonomous zone.

Scope

This runbook covers:

  • One BSFG zone using Alpha/Beta/Gamma node roles
  • JetStream cluster bring-up
  • Keepalived VIP management setup
  • BSFG controller promotion and demotion behavior
  • Initial validation and handoff

This runbook does not cover:

  • Cross-zone federation establishment
  • Multi-zone operational procedures
  • Ongoing operational recovery beyond initial deployment and validation

Reference

This runbook operationalizes the Reference Deployment Pattern: Triad-HA with Keepalived Failover.


1. Preconditions

Verify all before proceeding. Halt if any precondition fails.

Check Method Expected Result
Host inventory approved CMDB, asset register, or change ticket Three hosts assigned to Alpha, Beta, Gamma roles
IP assignments fixed Network allocation sheet Alpha, Beta, Gamma IPs reserved; VIP reserved
Storage devices present Hardware inventory or lsblk Required NVMe and RAID-backed disks physically present
Certificates issued openssl x509 -in /path/to/cert -noout -dates -subject Valid, not expired, identity matches zone policy
Time sync healthy timedatectl status or equivalent NTP synchronized, no material clock drift
Firewall rules in place Network policy ticket or host firewall verification Required ports configured per Triad-HA pattern
Secrets and configs rendered File presence and checksum verification /opt/bsfg/ contains rendered configs and secrets
Operator access established SSH verification Login succeeds to all three nodes

Halt if any precondition is not satisfied.


2. Inputs Required

Input Description Example
ZONE_NAME Zone identifier enterprise, plant-a
ALPHA_HOSTNAME Primary service-bearing node enterprise-alpha
BETA_HOSTNAME Secondary service-bearing node enterprise-beta
GAMMA_HOSTNAME Lightweight persistence node enterprise-gamma
VIP Floating virtual IP 10.1.1.10
PEER_ENDPOINTS Other zone endpoints for later federation 10.2.1.10:9443,10.3.1.10:9443
CERT_PATH mTLS certificate directory /opt/bsfg/certs/
JETSTREAM_CONFIG_PATH NATS config path /opt/bsfg/jetstream.conf
ARTIFACT_MOUNT Artifact storage mount path /artifacts
ALERT_ENDPOINT Monitoring or alerting destination alertmanager:9093

3. Host Preparation

Execute on all three nodes unless otherwise noted.

3.1 OS Baseline

Verify the operating system is within the supported matrix and fully patched.

Example verification:

source /etc/os-release
echo "$PRETTY_NAME"

Apply current security and package updates using the approved package manager for the host OS.

Expected result: Supported OS version, current security baseline applied, reboot completed if required.

Halt if: OS is outside supported matrix or required reboot is deferred.

3.2 Package Installation

Install required host packages and verify service availability.

Required capabilities include:

  • container runtime and Compose support
  • Keepalived
  • TLS tooling
  • monitoring/logging agent prerequisites
  • disk and RAID tooling

Example verification:

docker --version
docker compose version
keepalived --version
openssl version

Expected result: Required packages installed; Docker daemon running; Compose available.

Halt if: Container runtime fails to start, Compose unavailable, or package versions fall outside supported baseline.

3.3 Filesystem Creation and Mount

Prepare the dedicated JetStream disk on each node.

Required outcomes:

  • dedicated filesystem created on intended JetStream device
  • mounted at /data/jetstream
  • mount options aligned with policy
  • device identity recorded

Example verification:

lsblk -f
mountpoint /data/jetstream
findmnt /data/jetstream

Expected result: Dedicated JetStream volume mounted at /data/jetstream, formatted per policy, and verified.

Halt if: Wrong device selected, mount fails, or filesystem integrity is in doubt.

3.4 RAID Verification (Alpha and Beta)

Verify RAID-backed volumes for system and artifact storage are present and healthy.

Example verification:

cat /proc/mdstat
mountpoint /artifacts

Expected result: RAID arrays healthy, no degraded state, artifact mount present on Alpha/Beta.

Halt if: RAID degraded, missing, or rebuild in progress without explicit approval.

3.5 Mount Persistence

Ensure all required mounts survive reboot and reload cleanly through persistent mount configuration.

Example verification:

mount -a
mountpoint /data/jetstream

Expected result: Required mounts restore successfully from persistent configuration.

Halt if: Persistent mount configuration fails validation.

3.6 Log Agent Installation

Install and enable the approved node-level log shipping agent.

Expected result: Log agent installed, enabled, and configured to forward to the approved aggregation target.

Halt if: Agent cannot start, cannot reach aggregation target, or configuration is missing.

3.7 Time Sync Verification

Verify clock synchronization and acceptable offset on all nodes.

Example verification:

timedatectl status
chronyc tracking

Expected result: NTP synchronized; clock offset within approved threshold.

Halt if: Clock skew exceeds policy threshold or time sync is disabled.


4. Service Installation

4.1 JetStream Configuration Placement

On all nodes, place the rendered JetStream configuration in the approved location and verify:

  • correct zone or cluster identity
  • full route list for Alpha, Beta, Gamma
  • correct monitoring and cluster ports
  • correct data directory mapping

Example verification:

grep -n "cluster" /opt/bsfg/jetstream.conf
grep -n "routes" /opt/bsfg/jetstream.conf

Expected result: Configuration present, zone identity correct, all three node routes present.

Halt if: Cluster identity mismatched or route list incomplete.

4.2 Keepalived Configuration (Alpha and Beta)

Install the rendered Keepalived configuration on Alpha and Beta.

Verify:

  • Alpha and Beta share the same VRRP instance identity
  • Alpha has higher priority
  • unicast peers are correct
  • notify scripts and health-tracking scripts are present
  • VIP configured correctly

Example verification:

grep -n "virtual_router_id" /etc/keepalived/keepalived.conf
grep -n "priority" /etc/keepalived/keepalived.conf
grep -n "unicast_peer" -A 2 /etc/keepalived/keepalived.conf

Expected result: Keepalived configuration present and consistent across Alpha/Beta with correct priority asymmetry.

Halt if: Peer IPs, VRRP identity, notify hooks, or VIP settings are incorrect.

4.3 BSFG Controller Deployment Artifact Placement

On Alpha and Beta:

  • place controller deployment manifest
  • place certificates and CA material
  • verify file ownership and permissions
  • verify image/tag selection
  • verify bind address configured to VIP, not wildcard address

Example verification:

ls -l /opt/bsfg/certs
grep -n "image:" /opt/bsfg/docker-compose.yml
grep -n "BIND_ADDRESS" /opt/bsfg/docker-compose.yml

Expected result: Certificates present with restricted permissions; deployment manifest references approved image; controller configured to bind to VIP.

Halt if: Private key permissions are too broad, image reference missing, or bind address is not VIP-specific.

4.4 Systemd Unit and Slice Installation

Install systemd slices and services required for:

  • JetStream
  • BSFG controller
  • optional monitoring helpers

Reload systemd and enable only the services that should start independently.

Critical rule: the BSFG controller must not be allowed to auto-start independently of promotion flow.

Required state:

  • JetStream enabled
  • Keepalived enabled on Alpha/Beta
  • BSFG controller installed but not enabled for independent boot start
  • BSFG controller started only via promotion scripts or controlled orchestration

Example verification:

systemctl daemon-reload
systemctl is-enabled jetstream.service
systemctl is-enabled keepalived
systemctl is-enabled bsfg-controller.service

Expected result: JetStream enabled; Keepalived enabled where applicable; BSFG controller installed but not independently enabled.

Halt if: BSFG controller is configured to auto-start outside promotion flow.


5. Bootstrap Sequence

5.1 Start JetStream on All Nodes

Start JetStream on all three nodes. Gamma-first is conventional, but any order is acceptable if cluster formation is verified afterward.

Example:

docker compose -f /opt/bsfg/docker-compose.yml up -d jetstream

Expected result: JetStream container or service running on all three nodes.

Example verification:

docker ps | grep jetstream

5.2 Verify Cluster Formation

After startup, verify:

  • all three nodes visible
  • leader elected
  • cluster identity correct
  • no route or quorum errors

Example verification:

nats server report jetstream --server nats://localhost:4222

Expected result: Three nodes visible; leader elected; no split or quorum failure indicators.

Halt if: Fewer than three nodes visible, leader absent, or cluster state unhealthy.

5.3 Start Keepalived on Alpha and Beta

Start Keepalived on Alpha and Beta after JetStream cluster health is confirmed.

Example:

systemctl start keepalived
systemctl status keepalived

Expected result: Keepalived active on Alpha and Beta.

5.4 Verify VIP Acquisition

Verify exactly one node holds the VIP.

Example verification:

ip addr show

Expected result: Alpha holds VIP under normal initial conditions; Beta does not.

Halt if: VIP absent from both nodes or present on both nodes.

5.5 Verify Controller Starts Only on Active Node

Verify promotion behavior after VIP acquisition:

  • controller active on VIP holder
  • controller inactive on non-holder
  • no controller on Gamma
  • health endpoint reachable via VIP

Example verification:

systemctl status bsfg-controller
curl -k https://$VIP:9443/health
docker ps | grep bsfg

Expected result: Controller active only on VIP holder; health endpoint succeeds; Gamma has no controller.

Halt if: Controller active on more than one node, inactive on VIP holder, or absent from health endpoint.


6. Promotion and Demotion Validation

6.1 Verify Active Node Holds VIP

Confirm the active node currently owns the VIP and that the backup node does not.

Expected result: Single VIP owner.

6.2 Verify Backup Node Does Not Run Controller

Confirm Beta does not run the controller while not holding the VIP.

Expected result: Backup node controller inactive.

6.3 Test Demotion Stops Controller

Trigger controlled demotion by stopping Keepalived on the active node or using the approved administrative failover method.

Verify:

  • VIP removed from former active node
  • former active controller stopped
  • Beta acquires VIP within accepted failover window
  • Beta starts controller only after health gates pass

Example verification:

systemctl stop keepalived
ip addr show
systemctl is-active bsfg-controller

Expected result: No dual-active condition; Beta becomes active within accepted failover window.

Halt if: Dual-active occurs, VIP is orphaned, or controller fails to promote on healthy Beta.

6.4 Verify Bind-to-VIP Behavior

On the promoted node, verify the service binds specifically to the VIP and not wildcard interfaces.

Example verification:

ss -tlnp | grep 9443

Expected result: Controller bound to VIP-specific address only.

6.5 Restore Original Role Layout

Restore the original topology if desired and verify the previously active node rejoins as standby without causing flapping or dual-active behavior.

Expected result: Stable active/backup state restored; no controller on standby node.


7. Backup Setup

7.1 Snapshot Backup Job Placement

Install the approved JetStream snapshot backup script and schedule on the designated node or role.

Expected result: Backup job present, executable, and scheduled.

7.2 Remote Backup Target Configuration

Configure the approved remote target and verify:

  • connectivity
  • authentication
  • write permission
  • retention location correctness

Example verification:

rclone ls remote:bsfg-backup/${ZONE_NAME}/

Expected result: Remote target reachable and writable.

Halt if: Remote authentication fails or target unreachable.

7.3 Schedule Verification

Verify scheduled execution time, identity, and script path.

Expected result: Backup schedule present and correctly targeted.

7.4 Restore Procedure Reference

Place or reference the approved restore procedure document in the zone operations directory.

Expected result: Operators have clear disaster-recovery escalation path and restore reference.


8. Monitoring Bring-Up

8.1 Health Endpoints

Verify JetStream and BSFG health endpoints respond successfully.

Example verification:

curl http://localhost:8222/healthz
curl -k https://$VIP:9443/health

Expected result: Health endpoints return success.

8.2 Keepalived Role Monitoring

Verify MASTER/BACKUP role state is visible to monitoring.

Expected result: Keepalived role transitions observable via logs, exporter, or monitoring script.

8.3 JetStream Quorum Monitoring

Verify monitoring exposes:

  • cluster size
  • leader
  • replica state
  • quorum status

Example verification:

curl http://localhost:8222/jsz

Expected result: Cluster state visible to monitoring.

8.4 Disk and RAID Checks

Verify RAID state and device health are visible through monitoring and host tooling.

Example verification:

cat /proc/mdstat
smartctl -H /dev/nvme0n1

Expected result: Healthy RAID and acceptable device health.

8.5 Certificate Expiry Checks

Verify certificate validity monitoring is installed and alert threshold configured.

Expected result: Expiry horizon visible; alerting threshold enforced.

8.6 Backup Freshness Checks

Verify a backup freshness signal exists and can alert on stale backup state.

Expected result: Freshness metric or check available to monitoring.


9. Handoff Criteria

The zone is considered deployed and ready for federation bring-up only when all criteria pass.

Criterion Verification Status
JetStream quorum healthy nats server report jetstream shows 3 nodes and elected leader [ ]
VIP stable Exactly one service-bearing node holds VIP [ ]
Controller failover tested Controlled failover succeeded without dual-active [ ]
Monitoring live Health, metrics, and alerts visible [ ]
Backup verified Backup job installed and remote target validated [ ]
Certificate validity acceptable Validity horizon meets policy [ ]
No unresolved critical alerts Zone monitoring clear of critical unresolved conditions [ ]
Artifact mount verified Artifact storage mounted and available on active node [ ]
Log shipping active Logs visible in approved aggregation target [ ]

Sign-off

Role Name Date Signature
Deploying Engineer
Platform Lead

10. Post-Deployment References

Document Purpose
Checklist: Triad-HA Commissioning Acceptance and drill validation
Runbook: Cross-Zone Federation Bring-Up Establish federation with peer zones
Checklist: Cross-Zone Federation Validation Verify federation guarantees
Reference Deployment Pattern: Triad-HA with Keepalived Failover Architecture reference
Reference Interaction Pattern: Cross-Zone BSFG Federation Cross-zone architecture reference