Checklist: Triad-HA Commissioning
Purpose
Provide a concise acceptance and commissioning checklist for one deployed Triad-HA zone.
This checklist validates that a zone has been deployed correctly and that its architectural guarantees hold under test.
Scope
This checklist validates:
- A single Triad-HA zone deployment
- Deployment correctness against specification
- Failover behavior under simulated failure
- Local durability posture
- Monitoring and readiness posture
This checklist does not validate:
- Cross-zone federation (see Checklist: Cross-Zone Federation Validation)
- Application-level functionality
Reference
This checklist validates conformance to the Reference Deployment Pattern: Triad-HA with Keepalived Failover.
1. Preflight Checks
| Check | Method | Expected Result | Status | Notes |
|---|---|---|---|---|
| Hostnames correct | hostname on each node |
Alpha, Beta, Gamma match naming convention | [ ] | |
| VIP configured in network | ip addr on Alpha |
VIP present on expected interface | [ ] | |
| Storage mounted | mountpoint on each node |
JetStream disk at /data/jetstream, artifacts at /artifacts (Alpha/Beta) |
[ ] | |
| Certificates present and valid | openssl x509 -in /opt/bsfg/certs/server.crt -noout -dates -subject |
Not expired, certificate identity (subject/SAN) matches zone identity policy | [ ] | |
| JetStream configs aligned | diff or checksum across nodes |
Consistent cluster identity and complete route set with correct node-specific fields | [ ] | |
| Keepalived configs aligned | grep virtual_router_id on Alpha and Beta |
Same VRRP instance identity, correct peer IPs, intentional priority asymmetry | [ ] | |
| Clocks synchronized | timedatectl status |
NTP synchronized, offset < 100ms across nodes | [ ] |
2. Steady-State Validation
| Check | Method | Expected Result | Status | Notes |
|---|---|---|---|---|
| Exactly one service-bearing node holds VIP | `ip addr \ | grep $VIP` on Alpha and Beta | VIP present on exactly one node | [ ] |
| Alpha runs controller | systemctl is-active bsfg-controller on Alpha |
active | [ ] | |
| Beta does not run controller | systemctl is-active bsfg-controller on Beta |
inactive | [ ] | |
| Durability node runs JetStream only | docker ps or systemctl on Gamma |
JetStream service only, no controller service | [ ] | |
| Cluster quorum healthy | nats server report jetstream |
3 nodes, leader elected, no errors | [ ] | |
| Controller bound only to VIP | `ss -tlnp \ | grep 9443` on Alpha | Local Address: VIP:9443, not 0.0.0.0 | [ ] |
| Artifact mount available on active | mountpoint /artifacts on Alpha |
/artifacts is a mountpoint | [ ] | |
| Monitoring signals present | Check dashboards | JetStream metrics, Keepalived state, controller health visible | [ ] | |
| Log shipping active | Query log aggregator | Zone logs arriving with correct hostname tags | [ ] | |
| RAID status healthy | cat /proc/mdstat on Alpha/Beta |
[UU] for all arrays, no degraded | [ ] | |
| Certificate expiry acceptable | openssl x509 -checkend 2592000 |
Exit 0 (30+ days remaining) | [ ] |
3. Failover Drill
Execute controlled failover and verify behavior.
| Step | Action | Verification | Expected Result | Status | Notes |
|---|---|---|---|---|---|
| 3.1 | Record pre-failover state | Document active node, VIP holder, controller location | Alpha active, VIP on Alpha | [ ] | |
| 3.2 | Force Alpha demotion | systemctl stop keepalived on Alpha |
Alpha releases VIP | [ ] | |
| 3.3 | Verify VIP moves | `ip addr \ | grep $VIP` on Beta within configured window | VIP present on Beta | [ ] |
| 3.4 | Verify Beta controller starts | systemctl is-active bsfg-controller on Beta within SLA |
active | [ ] | |
| 3.5 | Verify health gates passed | journalctl -u bsfg-controller on Beta |
No "promotion denied" messages | [ ] | |
| 3.6 | Verify old controller stopped | systemctl is-active bsfg-controller on Alpha |
inactive | [ ] | |
| 3.7 | Verify no dual-active | `ss -tlnp \ | grep 9443` on both Alpha and Beta | Only Beta shows bound socket | [ ] |
| 3.8 | Verify service recovery | curl -k https://$VIP:9443/health |
200 OK, JSON response | [ ] | |
| 3.9 | Record failover time | Document duration from step 3.2 to 3.8 | Within configured SLA window | [ ] | |
| 3.10 | Restore Alpha | systemctl start keepalived on Alpha |
Alpha rejoins as BACKUP | [ ] | |
| 3.11 | Verify no flapping | Wait 60s, check VIP location | VIP remains on Beta | [ ] | |
| 3.12 | Verify failback ready | Alpha status BACKUP, ready to promote if needed | Beta remains MASTER | [ ] |
Drill Sign-off: Failover behavior acceptable: [ ] Yes [ ] No (requires remediation)
4. Degraded-State Drill (Advanced / Lab-Only)
Warning: These scenarios simulate destructive failure modes. Execute only in lab environments or during scheduled maintenance windows with explicit change control.
Verify behavior under partial failure.
| Scenario | Action | Verification | Expected Result | Status | Notes |
|---|---|---|---|---|---|
| Gamma loss | Stop JetStream on Gamma | nats server report jetstream from Alpha |
2 nodes, quorum preserved, no interruption | [ ] | |
| Missing artifact mount | Unmount /artifacts on active, attempt failover |
Promote to Beta (no artifact mount) | Promotion denied, alert generated | [ ] | |
| Trust/identity failure | Use certificate with wrong SAN or untrusted CA on Beta, attempt failover | Promotion with identity mismatch | mTLS handshake failure, promotion denied | [ ] | |
| Near-expiry certificate per policy | Set clock forward to within policy threshold (test only) | Certificate check behavior | Matches explicit policy: promotion blocked OR proceeds with warning per documented threshold | [ ] | |
| JetStream unhealthy | Stop JetStream on active node, attempt failover | systemctl stop jetstream, then failover |
Health gate fails, promotion denied | [ ] | |
| Network partition (simulated) | Block VRRP traffic between Alpha/Beta | Keepalived state and controller binding | VIP remains on one node only; no dual-active service binding; health gates prevent promotion to poisoned node | [ ] |
Degraded-State Sign-off: Zone handles partial failures correctly: [ ] Yes [ ] No (requires remediation)
5. Durability and Backup Checks
| Check | Method | Expected Result | Status | Notes |
|---|---|---|---|---|
| Snapshot backup completed | ls -la /backup/ or remote target |
Backup present, timestamp recent | [ ] | |
| Remote sync verified | rclone ls remote:bsfg-backup/$ZONE_NAME/ |
Backup files present on remote | [ ] | |
| Restore procedure reference exists | ls /opt/bsfg/docs/restore-procedure.md |
Document present | [ ] | |
| RAID degradation alert tested | mdadm --fail (test disk, then re-add) |
Alert fires, RAID recovers | [ ] | |
| Disk health visible | SMART data in monitoring | NVMe health metrics present | [ ] | |
| Backup freshness monitoring active | Check monitoring dashboard or alertmanager | Freshness alert configured and firing when backup age exceeds policy threshold | [ ] |
6. Acceptance Gate
6.1 Summary
| Category | Checks Passed | Checks Failed | Waived |
|---|---|---|---|
| Preflight | |||
| Steady-State | |||
| Failover Drill | |||
| Degraded-State | |||
| Durability/Backup |
6.2 Overall Status
Select one:
- Passed — All critical checks passed, zone ready for federation bring-up
- Passed with Exception — Minor issues documented and waived by Platform Lead
- Failed — Critical checks failed, requires remediation before handoff
- Requires Escalation — Uncertainty or blocker requiring architecture review
6.3 Exceptions and Waivers
| Check | Exception | Justification | Approved By | Date |
|---|---|---|---|---|
6.4 Sign-off
| Role | Name | Date | Signature |
|---|---|---|---|
| Commissioning Engineer | |||
| Platform Lead | |||
| Security/Compliance (if required) |
7. Post-Commissioning Reference
| Document | Purpose |
|---|---|
| Runbook: Cross-Zone Federation Bring-Up | Next step: establish federation |
| Checklist: Cross-Zone Federation Validation | Verify federation guarantees |
| Reference Deployment Pattern: Triad-HA | Architecture reference |
| Reference Interaction Pattern: Cross-Zone BSFG Federation | Cross-zone architecture reference |