Deployment

BSFG Triad Ha Commissioning Checklist

Checklist: Triad-HA Commissioning

Purpose

Provide a concise acceptance and commissioning checklist for one deployed Triad-HA zone.

This checklist validates that a zone has been deployed correctly and that its architectural guarantees hold under test.

Scope

This checklist validates:

  • A single Triad-HA zone deployment
  • Deployment correctness against specification
  • Failover behavior under simulated failure
  • Local durability posture
  • Monitoring and readiness posture

This checklist does not validate:

  • Cross-zone federation (see Checklist: Cross-Zone Federation Validation)
  • Application-level functionality

Reference

This checklist validates conformance to the Reference Deployment Pattern: Triad-HA with Keepalived Failover.


1. Preflight Checks

Check Method Expected Result Status Notes
Hostnames correct hostname on each node Alpha, Beta, Gamma match naming convention [ ]
VIP configured in network ip addr on Alpha VIP present on expected interface [ ]
Storage mounted mountpoint on each node JetStream disk at /data/jetstream, artifacts at /artifacts (Alpha/Beta) [ ]
Certificates present and valid openssl x509 -in /opt/bsfg/certs/server.crt -noout -dates -subject Not expired, certificate identity (subject/SAN) matches zone identity policy [ ]
JetStream configs aligned diff or checksum across nodes Consistent cluster identity and complete route set with correct node-specific fields [ ]
Keepalived configs aligned grep virtual_router_id on Alpha and Beta Same VRRP instance identity, correct peer IPs, intentional priority asymmetry [ ]
Clocks synchronized timedatectl status NTP synchronized, offset < 100ms across nodes [ ]

2. Steady-State Validation

Check Method Expected Result Status Notes
Exactly one service-bearing node holds VIP `ip addr \ grep $VIP` on Alpha and Beta VIP present on exactly one node [ ]
Alpha runs controller systemctl is-active bsfg-controller on Alpha active [ ]
Beta does not run controller systemctl is-active bsfg-controller on Beta inactive [ ]
Durability node runs JetStream only docker ps or systemctl on Gamma JetStream service only, no controller service [ ]
Cluster quorum healthy nats server report jetstream 3 nodes, leader elected, no errors [ ]
Controller bound only to VIP `ss -tlnp \ grep 9443` on Alpha Local Address: VIP:9443, not 0.0.0.0 [ ]
Artifact mount available on active mountpoint /artifacts on Alpha /artifacts is a mountpoint [ ]
Monitoring signals present Check dashboards JetStream metrics, Keepalived state, controller health visible [ ]
Log shipping active Query log aggregator Zone logs arriving with correct hostname tags [ ]
RAID status healthy cat /proc/mdstat on Alpha/Beta [UU] for all arrays, no degraded [ ]
Certificate expiry acceptable openssl x509 -checkend 2592000 Exit 0 (30+ days remaining) [ ]

3. Failover Drill

Execute controlled failover and verify behavior.

Step Action Verification Expected Result Status Notes
3.1 Record pre-failover state Document active node, VIP holder, controller location Alpha active, VIP on Alpha [ ]
3.2 Force Alpha demotion systemctl stop keepalived on Alpha Alpha releases VIP [ ]
3.3 Verify VIP moves `ip addr \ grep $VIP` on Beta within configured window VIP present on Beta [ ]
3.4 Verify Beta controller starts systemctl is-active bsfg-controller on Beta within SLA active [ ]
3.5 Verify health gates passed journalctl -u bsfg-controller on Beta No "promotion denied" messages [ ]
3.6 Verify old controller stopped systemctl is-active bsfg-controller on Alpha inactive [ ]
3.7 Verify no dual-active `ss -tlnp \ grep 9443` on both Alpha and Beta Only Beta shows bound socket [ ]
3.8 Verify service recovery curl -k https://$VIP:9443/health 200 OK, JSON response [ ]
3.9 Record failover time Document duration from step 3.2 to 3.8 Within configured SLA window [ ]
3.10 Restore Alpha systemctl start keepalived on Alpha Alpha rejoins as BACKUP [ ]
3.11 Verify no flapping Wait 60s, check VIP location VIP remains on Beta [ ]
3.12 Verify failback ready Alpha status BACKUP, ready to promote if needed Beta remains MASTER [ ]

Drill Sign-off: Failover behavior acceptable: [ ] Yes [ ] No (requires remediation)


4. Degraded-State Drill (Advanced / Lab-Only)

Warning: These scenarios simulate destructive failure modes. Execute only in lab environments or during scheduled maintenance windows with explicit change control.

Verify behavior under partial failure.

Scenario Action Verification Expected Result Status Notes
Gamma loss Stop JetStream on Gamma nats server report jetstream from Alpha 2 nodes, quorum preserved, no interruption [ ]
Missing artifact mount Unmount /artifacts on active, attempt failover Promote to Beta (no artifact mount) Promotion denied, alert generated [ ]
Trust/identity failure Use certificate with wrong SAN or untrusted CA on Beta, attempt failover Promotion with identity mismatch mTLS handshake failure, promotion denied [ ]
Near-expiry certificate per policy Set clock forward to within policy threshold (test only) Certificate check behavior Matches explicit policy: promotion blocked OR proceeds with warning per documented threshold [ ]
JetStream unhealthy Stop JetStream on active node, attempt failover systemctl stop jetstream, then failover Health gate fails, promotion denied [ ]
Network partition (simulated) Block VRRP traffic between Alpha/Beta Keepalived state and controller binding VIP remains on one node only; no dual-active service binding; health gates prevent promotion to poisoned node [ ]

Degraded-State Sign-off: Zone handles partial failures correctly: [ ] Yes [ ] No (requires remediation)


5. Durability and Backup Checks

Check Method Expected Result Status Notes
Snapshot backup completed ls -la /backup/ or remote target Backup present, timestamp recent [ ]
Remote sync verified rclone ls remote:bsfg-backup/$ZONE_NAME/ Backup files present on remote [ ]
Restore procedure reference exists ls /opt/bsfg/docs/restore-procedure.md Document present [ ]
RAID degradation alert tested mdadm --fail (test disk, then re-add) Alert fires, RAID recovers [ ]
Disk health visible SMART data in monitoring NVMe health metrics present [ ]
Backup freshness monitoring active Check monitoring dashboard or alertmanager Freshness alert configured and firing when backup age exceeds policy threshold [ ]

6. Acceptance Gate

6.1 Summary

Category Checks Passed Checks Failed Waived
Preflight
Steady-State
Failover Drill
Degraded-State
Durability/Backup

6.2 Overall Status

Select one:

  • Passed — All critical checks passed, zone ready for federation bring-up
  • Passed with Exception — Minor issues documented and waived by Platform Lead
  • Failed — Critical checks failed, requires remediation before handoff
  • Requires Escalation — Uncertainty or blocker requiring architecture review

6.3 Exceptions and Waivers

Check Exception Justification Approved By Date

6.4 Sign-off

Role Name Date Signature
Commissioning Engineer
Platform Lead
Security/Compliance (if required)

7. Post-Commissioning Reference

Document Purpose
Runbook: Cross-Zone Federation Bring-Up Next step: establish federation
Checklist: Cross-Zone Federation Validation Verify federation guarantees
Reference Deployment Pattern: Triad-HA Architecture reference
Reference Interaction Pattern: Cross-Zone BSFG Federation Cross-zone architecture reference