π§ͺ Karpenter ARM NodePool + NGINX HA Validation:
- β Karpenter NodePool
- β NGINX Deployment
- β ExternalDNS
- β Route53
- β Scaling
- β Node failure
- β EC2 crash
- β Maintenance window
- β Consolidation
- β Downtime measurement
- β Worst-case scenarios
This is written like real production change documentation.
π PRODUCTION SOP
Title: Karpenter ARM NodePool + NGINX High Availability Validation Environment: Production Objective: Ensure ZERO or minimal downtime under all scenarios.
π§© ARCHITECTURE OVERVIEW
User
β
Route53
β
ExternalDNS
β
ELB (NGINX Controller)
β
NGINX Ingress
β
Service (ClusterIP)
β
Pod (nginx-test)
β
Karpenter ARM Nodeπ’ PHASE 1 β PRE-CHECK (MANDATORY)
1.1 Confirm Ingress Controller
kubectl get svc -n ingress-nginxMust show:
TYPE: LoadBalancer
EXTERNAL-IP: *.elb.amazonaws.comIf not β STOP.
1.2 Confirm ExternalDNS Running
kubectl get pods -n external-dnsIf not running β Route53 wonβt auto update.
1.3 Confirm Karpenter Healthy
kubectl get pods -n karpenterkubectl get nodepool1.4 Start Downtime Monitor (VERY IMPORTANT)
From bastion:
while true; do
date >> downtime.log
curl -s -o /dev/null -w "%{http_code} %{time_total}\n" http://nginx-test.yourdomain.com >> downtime.log
sleep 1
doneThis detects:
- 503
- 504
- Time spikes
- Any failure
This log = final truth.
π’ PHASE 2 β APPLY KARPENTER NODEPOOL
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
annotations:
karpenter.sh/nodepool-hash: "7048599442334880678"
karpenter.sh/nodepool-hash-version: v3
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"karpenter.sh/v1","kind":"NodePool","metadata":{"annotations":{"kubernetes.io/description":"Graviton NodePool for ARM-based workloads"},"name":"general-purpose-arm"},"spec":{"template":{"metadata":{"labels":{"role":"shopify-prod"}},"spec":{"nodeClassRef":{"group":"karpenter.k8s.aws","kind":"EC2NodeClass","name":"graviton-default"},"requirements":[{"key":"kubernetes.io/arch","operator":"In","values":["arm64"]},{"key":"kubernetes.io/os","operator":"In","values":["linux"]},{"key":"karpenter.sh/capacity-type","operator":"In","values":["on-demand"]},{"key":"karpenter.k8s.aws/instance-category","operator":"In","values":["r"]},{"key":"karpenter.k8s.aws/instance-generation","operator":"Gt","values":["6"]}]}}}}
kubernetes.io/description: Graviton NodePool for ARM-based workloads
creationTimestamp: "2026-02-12T06:07:19Z"
generation: 4
name: general-purpose-arm
resourceVersion: "12529914"
uid: 723af7cb-28dc-4a89-b0d8-c17c47348864
spec:
disruption:
budgets:
- nodes: 10%
consolidateAfter: 1h
consolidationPolicy: WhenEmpty
template:
metadata:
labels:
role: shopify-prod
spec:
expireAfter: 720h
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: graviton-default
requirements:
- key: kubernetes.io/arch
operator: In
values:
- arm64
- key: kubernetes.io/os
operator: In
values:
- linux
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
- key: karpenter.k8s.aws/instance-category
operator: In
values:
- r
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values:
- "6"
status:
conditions:
- lastTransitionTime: "2026-02-12T06:07:19Z"
message: ""
observedGeneration: 4
reason: ValidationSucceeded
status: "True"
type: ValidationSucceeded
- lastTransitionTime: "2026-02-12T06:07:34Z"
message: ""
observedGeneration: 4
reason: NodeClassReady
status: "True"
type: NodeClassReady
- lastTransitionTime: "2026-02-18T19:28:49Z"
message: object is awaiting reconciliation
observedGeneration: 4
reason: AwaitingReconciliation
status: Unknown
type: NodeRegistrationHealthy
- lastTransitionTime: "2026-02-18T19:28:49Z"
message: ""
observedGeneration: 4
reason: Ready
status: "True"
type: Ready
nodeClassObservedGeneration: 1
resources:
cpu: "0"
ephemeral-storage: "0"
memory: "0"
nodes: "0"
pods: "0"
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-test
spec:
replicas: 2
selector:
matchLabels:
app: nginx-test
template:
metadata:
labels:
app: nginx-test
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
----------------
apiVersion: v1
kind: Service
metadata:
name: nginx-test
spec:
selector:
app: nginx-test
ports:
- port: 80
targetPort: 80
type: ClusterIP
------------------
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: nginx-test
annotations:
kubernetes.io/ingress.class: nginx
external-dns.alpha.kubernetes.io/hostname: nginx-test.yourdomain.com
spec:
rules:
- host: nginx-test.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: nginx-test
port:
number: 80
Apply:
kubectl apply -f karpenter-nginx-test.yamlVerify:
kubectl describe nodepool general-purpose-armExpected:
- No immediate disruption
- No node termination
Because:
nodes: "0"
expireAfter: NeverIf nodes restart β configuration error.
π’ PHASE 3 β DEPLOY NGINX APP
Apply:
kubectl apply -f nginx-deployment.yaml
kubectl apply -f nginx-service.yaml
kubectl apply -f nginx-ingress.yamlVerify:
kubectl get pods -o wide
kubectl get ingressVerify Route53 record created.
Test:
curl http://nginx-test.yourdomain.comExpected:
Welcome to nginx!π’ PHASE 4 β SCALING TESTS
Scenario 1 β Scale Up
kubectl scale deployment nginx-test --replicas=5Expected:
- Pods increase
- If no capacity β Karpenter provisions new ARM node
- NO downtime
Check:
kubectl get nodesCheck downtime log β should remain 200.
Scenario 2 β Scale Down
kubectl scale deployment nginx-test --replicas=1Expected:
- Pods terminate gracefully
- No downtime
- Extra nodes may remain (due to conservative consolidation)
Scenario 3 β Single Replica Risk
With replicas=1:
kubectl delete pod <pod-name>Expected:
- 1β5 sec 503 spike
- Pod recreated
This proves:
Production MUST use replicas β₯2.
π’ PHASE 5 β NODE FAILURE TESTS
Scenario 4 β Manual Node Drain
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-dataExpected:
- Pods rescheduled
- No downtime (if replicas β₯2)
- Node may stay due to nodes: "0"
Check downtime log.
Scenario 5 β EC2 Hard Kill
aws ec2 terminate-instances --instance-ids <id>Expected:
- Node NotReady
- Karpenter provisions new ARM node
- Pods rescheduled
- Small recovery window
Measure:
- Time node ready
- Time pod ready
- Downtime seconds
Target: < 60 sec total.
π’ PHASE 6 β KARPENTER DISRUPTION TEST
Temporarily change:
nodes: "1"Apply.
Then trigger drift:
kubectl annotate node <node> karpenter.sh/disruption=driftExpected:
- Only 1 node replaced
- No mass eviction
- No downtime
Revert config after test.
π’ PHASE 7 β CONSOLIDATION TEST
Temporarily set:
consolidateAfter: 2mScale down replicas to 1.
Wait.
Expected:
- Empty node removed
- Running node untouched
- No downtime
π’ PHASE 8 β LOAD TEST
Install hey:
sudo apt install heyRun:
hey -n 20000 -c 500 http://nginx-test.yourdomain.comMonitor:
kubectl top nodes
kubectl top podsExpected:
- No 5xx
- CPU stable
- No pod crashes
π’ PHASE 9 β INGRESS CONTROLLER FAILURE
Delete one ingress controller pod:
kubectl delete pod -n ingress-nginx <pod-name>Expected:
- No downtime
- Because replicas β₯2
If downtime occurs β ingress HA issue.
π’ PHASE 10 β DNS FAILURE TEST
Temporarily delete Route53 record.
Check:
- How long clients fail?
- TTL behavior
Restore record.
This tests real-world DNS dependency.
π‘οΈ MANDATORY PRODUCTION SAFETY
Add PodDisruptionBudget:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: nginx-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: nginx-testπ WHAT YOU DOCUMENT
For each scenario record:
| Scenario | Downtime | Node Replacement Time | Pod Ready Time |
|---|---|---|---|
| Scale up | 0s | X | X |
| Scale down | 0s | - | X |
| Pod kill | 2s | - | 8s |
| Node drain | 0s | X | X |
| EC2 kill | 20s | 45s | 12s |
| Drift replace | 0s | X | X |
π― FINAL ACCEPTANCE CRITERIA
Production ready if:
β No downtime during scaling β No downtime during node drain β Max 60 sec during EC2 crash β No mass eviction β Only 1 node replaced during maintenance β No surprise replacement
π¨ MOST IMPORTANT THINGS
For real zero downtime:
- Ingress controller replicas β₯2
- App replicas β₯2
- PDB enabled
- Readiness probes configured
- Graceful termination period set
- Proper health checks in ELB
Without readiness probes, downtime may happen even with 2 replicas.
π₯ Your Setup Verdict
Your NodePool config is:
- Stability-first
- No surprise expiry
- No spot interruption
- Controlled maintenance
- Production-safe