Runbook: Triad-HA Zone Deployment
Purpose
Operationalize the deployment of one BSFG zone using the Triad-HA deployment pattern.
This runbook translates the reference architecture into ordered operator actions for preparing, installing, bootstrapping, and validating a single autonomous zone.
Scope
This runbook covers:
- One BSFG zone using Alpha/Beta/Gamma node roles
- JetStream cluster bring-up
- Keepalived VIP management setup
- BSFG controller promotion and demotion behavior
- Initial validation and handoff
This runbook does not cover:
- Cross-zone federation establishment
- Multi-zone operational procedures
- Ongoing operational recovery beyond initial deployment and validation
Reference
This runbook operationalizes the Reference Deployment Pattern: Triad-HA with Keepalived Failover.
1. Preconditions
Verify all before proceeding. Halt if any precondition fails.
| Check | Method | Expected Result |
|---|---|---|
| Host inventory approved | CMDB, asset register, or change ticket | Three hosts assigned to Alpha, Beta, Gamma roles |
| IP assignments fixed | Network allocation sheet | Alpha, Beta, Gamma IPs reserved; VIP reserved |
| Storage devices present | Hardware inventory or lsblk |
Required NVMe and RAID-backed disks physically present |
| Certificates issued | openssl x509 -in /path/to/cert -noout -dates -subject |
Valid, not expired, identity matches zone policy |
| Time sync healthy | timedatectl status or equivalent |
NTP synchronized, no material clock drift |
| Firewall rules in place | Network policy ticket or host firewall verification | Required ports configured per Triad-HA pattern |
| Secrets and configs rendered | File presence and checksum verification | /opt/bsfg/ contains rendered configs and secrets |
| Operator access established | SSH verification | Login succeeds to all three nodes |
Halt if any precondition is not satisfied.
2. Inputs Required
| Input | Description | Example |
|---|---|---|
ZONE_NAME |
Zone identifier | enterprise, plant-a |
ALPHA_HOSTNAME |
Primary service-bearing node | enterprise-alpha |
BETA_HOSTNAME |
Secondary service-bearing node | enterprise-beta |
GAMMA_HOSTNAME |
Lightweight persistence node | enterprise-gamma |
VIP |
Floating virtual IP | 10.1.1.10 |
PEER_ENDPOINTS |
Other zone endpoints for later federation | 10.2.1.10:9443,10.3.1.10:9443 |
CERT_PATH |
mTLS certificate directory | /opt/bsfg/certs/ |
JETSTREAM_CONFIG_PATH |
NATS config path | /opt/bsfg/jetstream.conf |
ARTIFACT_MOUNT |
Artifact storage mount path | /artifacts |
ALERT_ENDPOINT |
Monitoring or alerting destination | alertmanager:9093 |
3. Host Preparation
Execute on all three nodes unless otherwise noted.
3.1 OS Baseline
Verify the operating system is within the supported matrix and fully patched.
Example verification:
source /etc/os-release
echo "$PRETTY_NAME"
Apply current security and package updates using the approved package manager for the host OS.
Expected result: Supported OS version, current security baseline applied, reboot completed if required.
Halt if: OS is outside supported matrix or required reboot is deferred.
3.2 Package Installation
Install required host packages and verify service availability.
Required capabilities include:
- container runtime and Compose support
- Keepalived
- TLS tooling
- monitoring/logging agent prerequisites
- disk and RAID tooling
Example verification:
docker --version
docker compose version
keepalived --version
openssl version
Expected result: Required packages installed; Docker daemon running; Compose available.
Halt if: Container runtime fails to start, Compose unavailable, or package versions fall outside supported baseline.
3.3 Filesystem Creation and Mount
Prepare the dedicated JetStream disk on each node.
Required outcomes:
- dedicated filesystem created on intended JetStream device
- mounted at
/data/jetstream - mount options aligned with policy
- device identity recorded
Example verification:
lsblk -f
mountpoint /data/jetstream
findmnt /data/jetstream
Expected result: Dedicated JetStream volume mounted at /data/jetstream, formatted per policy, and verified.
Halt if: Wrong device selected, mount fails, or filesystem integrity is in doubt.
3.4 RAID Verification (Alpha and Beta)
Verify RAID-backed volumes for system and artifact storage are present and healthy.
Example verification:
cat /proc/mdstat
mountpoint /artifacts
Expected result: RAID arrays healthy, no degraded state, artifact mount present on Alpha/Beta.
Halt if: RAID degraded, missing, or rebuild in progress without explicit approval.
3.5 Mount Persistence
Ensure all required mounts survive reboot and reload cleanly through persistent mount configuration.
Example verification:
mount -a
mountpoint /data/jetstream
Expected result: Required mounts restore successfully from persistent configuration.
Halt if: Persistent mount configuration fails validation.
3.6 Log Agent Installation
Install and enable the approved node-level log shipping agent.
Expected result: Log agent installed, enabled, and configured to forward to the approved aggregation target.
Halt if: Agent cannot start, cannot reach aggregation target, or configuration is missing.
3.7 Time Sync Verification
Verify clock synchronization and acceptable offset on all nodes.
Example verification:
timedatectl status
chronyc tracking
Expected result: NTP synchronized; clock offset within approved threshold.
Halt if: Clock skew exceeds policy threshold or time sync is disabled.
4. Service Installation
4.1 JetStream Configuration Placement
On all nodes, place the rendered JetStream configuration in the approved location and verify:
- correct zone or cluster identity
- full route list for Alpha, Beta, Gamma
- correct monitoring and cluster ports
- correct data directory mapping
Example verification:
grep -n "cluster" /opt/bsfg/jetstream.conf
grep -n "routes" /opt/bsfg/jetstream.conf
Expected result: Configuration present, zone identity correct, all three node routes present.
Halt if: Cluster identity mismatched or route list incomplete.
4.2 Keepalived Configuration (Alpha and Beta)
Install the rendered Keepalived configuration on Alpha and Beta.
Verify:
- Alpha and Beta share the same VRRP instance identity
- Alpha has higher priority
- unicast peers are correct
- notify scripts and health-tracking scripts are present
- VIP configured correctly
Example verification:
grep -n "virtual_router_id" /etc/keepalived/keepalived.conf
grep -n "priority" /etc/keepalived/keepalived.conf
grep -n "unicast_peer" -A 2 /etc/keepalived/keepalived.conf
Expected result: Keepalived configuration present and consistent across Alpha/Beta with correct priority asymmetry.
Halt if: Peer IPs, VRRP identity, notify hooks, or VIP settings are incorrect.
4.3 BSFG Controller Deployment Artifact Placement
On Alpha and Beta:
- place controller deployment manifest
- place certificates and CA material
- verify file ownership and permissions
- verify image/tag selection
- verify bind address configured to VIP, not wildcard address
Example verification:
ls -l /opt/bsfg/certs
grep -n "image:" /opt/bsfg/docker-compose.yml
grep -n "BIND_ADDRESS" /opt/bsfg/docker-compose.yml
Expected result: Certificates present with restricted permissions; deployment manifest references approved image; controller configured to bind to VIP.
Halt if: Private key permissions are too broad, image reference missing, or bind address is not VIP-specific.
4.4 Systemd Unit and Slice Installation
Install systemd slices and services required for:
- JetStream
- BSFG controller
- optional monitoring helpers
Reload systemd and enable only the services that should start independently.
Critical rule: the BSFG controller must not be allowed to auto-start independently of promotion flow.
Required state:
- JetStream enabled
- Keepalived enabled on Alpha/Beta
- BSFG controller installed but not enabled for independent boot start
- BSFG controller started only via promotion scripts or controlled orchestration
Example verification:
systemctl daemon-reload
systemctl is-enabled jetstream.service
systemctl is-enabled keepalived
systemctl is-enabled bsfg-controller.service
Expected result: JetStream enabled; Keepalived enabled where applicable; BSFG controller installed but not independently enabled.
Halt if: BSFG controller is configured to auto-start outside promotion flow.
5. Bootstrap Sequence
5.1 Start JetStream on All Nodes
Start JetStream on all three nodes. Gamma-first is conventional, but any order is acceptable if cluster formation is verified afterward.
Example:
docker compose -f /opt/bsfg/docker-compose.yml up -d jetstream
Expected result: JetStream container or service running on all three nodes.
Example verification:
docker ps | grep jetstream
5.2 Verify Cluster Formation
After startup, verify:
- all three nodes visible
- leader elected
- cluster identity correct
- no route or quorum errors
Example verification:
nats server report jetstream --server nats://localhost:4222
Expected result: Three nodes visible; leader elected; no split or quorum failure indicators.
Halt if: Fewer than three nodes visible, leader absent, or cluster state unhealthy.
5.3 Start Keepalived on Alpha and Beta
Start Keepalived on Alpha and Beta after JetStream cluster health is confirmed.
Example:
systemctl start keepalived
systemctl status keepalived
Expected result: Keepalived active on Alpha and Beta.
5.4 Verify VIP Acquisition
Verify exactly one node holds the VIP.
Example verification:
ip addr show
Expected result: Alpha holds VIP under normal initial conditions; Beta does not.
Halt if: VIP absent from both nodes or present on both nodes.
5.5 Verify Controller Starts Only on Active Node
Verify promotion behavior after VIP acquisition:
- controller active on VIP holder
- controller inactive on non-holder
- no controller on Gamma
- health endpoint reachable via VIP
Example verification:
systemctl status bsfg-controller
curl -k https://$VIP:9443/health
docker ps | grep bsfg
Expected result: Controller active only on VIP holder; health endpoint succeeds; Gamma has no controller.
Halt if: Controller active on more than one node, inactive on VIP holder, or absent from health endpoint.
6. Promotion and Demotion Validation
6.1 Verify Active Node Holds VIP
Confirm the active node currently owns the VIP and that the backup node does not.
Expected result: Single VIP owner.
6.2 Verify Backup Node Does Not Run Controller
Confirm Beta does not run the controller while not holding the VIP.
Expected result: Backup node controller inactive.
6.3 Test Demotion Stops Controller
Trigger controlled demotion by stopping Keepalived on the active node or using the approved administrative failover method.
Verify:
- VIP removed from former active node
- former active controller stopped
- Beta acquires VIP within accepted failover window
- Beta starts controller only after health gates pass
Example verification:
systemctl stop keepalived
ip addr show
systemctl is-active bsfg-controller
Expected result: No dual-active condition; Beta becomes active within accepted failover window.
Halt if: Dual-active occurs, VIP is orphaned, or controller fails to promote on healthy Beta.
6.4 Verify Bind-to-VIP Behavior
On the promoted node, verify the service binds specifically to the VIP and not wildcard interfaces.
Example verification:
ss -tlnp | grep 9443
Expected result: Controller bound to VIP-specific address only.
6.5 Restore Original Role Layout
Restore the original topology if desired and verify the previously active node rejoins as standby without causing flapping or dual-active behavior.
Expected result: Stable active/backup state restored; no controller on standby node.
7. Backup Setup
7.1 Snapshot Backup Job Placement
Install the approved JetStream snapshot backup script and schedule on the designated node or role.
Expected result: Backup job present, executable, and scheduled.
7.2 Remote Backup Target Configuration
Configure the approved remote target and verify:
- connectivity
- authentication
- write permission
- retention location correctness
Example verification:
rclone ls remote:bsfg-backup/${ZONE_NAME}/
Expected result: Remote target reachable and writable.
Halt if: Remote authentication fails or target unreachable.
7.3 Schedule Verification
Verify scheduled execution time, identity, and script path.
Expected result: Backup schedule present and correctly targeted.
7.4 Restore Procedure Reference
Place or reference the approved restore procedure document in the zone operations directory.
Expected result: Operators have clear disaster-recovery escalation path and restore reference.
8. Monitoring Bring-Up
8.1 Health Endpoints
Verify JetStream and BSFG health endpoints respond successfully.
Example verification:
curl http://localhost:8222/healthz
curl -k https://$VIP:9443/health
Expected result: Health endpoints return success.
8.2 Keepalived Role Monitoring
Verify MASTER/BACKUP role state is visible to monitoring.
Expected result: Keepalived role transitions observable via logs, exporter, or monitoring script.
8.3 JetStream Quorum Monitoring
Verify monitoring exposes:
- cluster size
- leader
- replica state
- quorum status
Example verification:
curl http://localhost:8222/jsz
Expected result: Cluster state visible to monitoring.
8.4 Disk and RAID Checks
Verify RAID state and device health are visible through monitoring and host tooling.
Example verification:
cat /proc/mdstat
smartctl -H /dev/nvme0n1
Expected result: Healthy RAID and acceptable device health.
8.5 Certificate Expiry Checks
Verify certificate validity monitoring is installed and alert threshold configured.
Expected result: Expiry horizon visible; alerting threshold enforced.
8.6 Backup Freshness Checks
Verify a backup freshness signal exists and can alert on stale backup state.
Expected result: Freshness metric or check available to monitoring.
9. Handoff Criteria
The zone is considered deployed and ready for federation bring-up only when all criteria pass.
| Criterion | Verification | Status |
|---|---|---|
| JetStream quorum healthy | nats server report jetstream shows 3 nodes and elected leader |
[ ] |
| VIP stable | Exactly one service-bearing node holds VIP | [ ] |
| Controller failover tested | Controlled failover succeeded without dual-active | [ ] |
| Monitoring live | Health, metrics, and alerts visible | [ ] |
| Backup verified | Backup job installed and remote target validated | [ ] |
| Certificate validity acceptable | Validity horizon meets policy | [ ] |
| No unresolved critical alerts | Zone monitoring clear of critical unresolved conditions | [ ] |
| Artifact mount verified | Artifact storage mounted and available on active node | [ ] |
| Log shipping active | Logs visible in approved aggregation target | [ ] |
Sign-off
| Role | Name | Date | Signature |
|---|---|---|---|
| Deploying Engineer | |||
| Platform Lead |
10. Post-Deployment References
| Document | Purpose |
|---|---|
| Checklist: Triad-HA Commissioning | Acceptance and drill validation |
| Runbook: Cross-Zone Federation Bring-Up | Establish federation with peer zones |
| Checklist: Cross-Zone Federation Validation | Verify federation guarantees |
| Reference Deployment Pattern: Triad-HA with Keepalived Failover | Architecture reference |
| Reference Interaction Pattern: Cross-Zone BSFG Federation | Cross-zone architecture reference |