🧪 Karpenter ARM NodePool + NGINX HA Validation:

✅ Karpenter NodePool
✅ NGINX Deployment
✅ ExternalDNS
✅ Route53
✅ Scaling
✅ Node failure
✅ EC2 crash
✅ Maintenance window
✅ Consolidation
✅ Downtime measurement
✅ Worst-case scenarios

This is written like real production change documentation.

📘 PRODUCTION SOP

Title: Karpenter ARM NodePool + NGINX High Availability Validation Environment: Production Objective: Ensure ZERO or minimal downtime under all scenarios.

🧩 ARCHITECTURE OVERVIEW

User
 ↓
Route53
 ↓
ExternalDNS
 ↓
ELB (NGINX Controller)
 ↓
NGINX Ingress
 ↓
Service (ClusterIP)
 ↓
Pod (nginx-test)
 ↓
Karpenter ARM Node

🟢 PHASE 1 — PRE-CHECK (MANDATORY)

1.1 Confirm Ingress Controller

kubectl get svc -n ingress-nginx

Must show:

TYPE: LoadBalancer
EXTERNAL-IP: *.elb.amazonaws.com

If not → STOP.

1.2 Confirm ExternalDNS Running

kubectl get pods -n external-dns

If not running → Route53 won’t auto update.

1.3 Confirm Karpenter Healthy

kubectl get pods -n karpenter

kubectl get nodepool

1.4 Start Downtime Monitor (VERY IMPORTANT)

From bastion:

while true; do
  date >> downtime.log
  curl -s -o /dev/null -w "%{http_code} %{time_total}\n" http://nginx-test.yourdomain.com >> downtime.log
  sleep 1
done

This detects:

503
504
Time spikes
Any failure

This log = final truth.

🟢 PHASE 2 — APPLY KARPENTER NODEPOOL this is for sepecial use case for arm modify accordingly

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  annotations:
    kubernetes.io/description: Graviton NodePool for ARM-based workloads
  name: general-purpose-arm
spec:
  disruption:
    budgets:
      - nodes: "0"            # 🔒 Block ALL voluntary disruption by default
      - schedule: "0 2 * * 6" # Allow during Saturday 2AM maintenance window
        duration: 2h
        nodes: "1"            # Only 1 node at a time during window
    consolidationPolicy: WhenEmpty
    consolidateAfter: 24h     # Much more conservative — wait a full day
  template:
    metadata:
      labels:
        role: shopify-stage
    spec:
      expireAfter: Never      # 🔒 Disable forced node expiry — no more surprise replacements
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: graviton-default
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values:
            - arm64
        - key: kubernetes.io/os
          operator: In
          values:
            - linux
        - key: karpenter.sh/capacity-type
          operator: In
          values:
            - on-demand
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values:
            - r
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values:
            - "6"

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-test
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx-test
  template:
    metadata:
      labels:
        app: nginx-test
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80
----------------
apiVersion: v1
kind: Service
metadata:
  name: nginx-test
spec:
  selector:
    app: nginx-test
  ports:
  - port: 80
    targetPort: 80
  type: ClusterIP
------------------
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: nginx-test
  annotations:
    kubernetes.io/ingress.class: nginx
    external-dns.alpha.kubernetes.io/hostname: nginx-test.yourdomain.com
spec:
  rules:
  - host: nginx-test.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: nginx-test
            port:
              number: 80

Apply:

kubectl apply -f karpenter-nginx-test.yaml

Verify:

kubectl describe nodepool general-purpose-arm

Expected:

No immediate disruption
No node termination

Because:

nodes: "0"
expireAfter: Never

If nodes restart → configuration error.

🟢 PHASE 3 — DEPLOY NGINX APP

Apply:

kubectl apply -f nginx-deployment.yaml
kubectl apply -f nginx-service.yaml
kubectl apply -f nginx-ingress.yaml

Verify:

kubectl get pods -o wide
kubectl get ingress

Verify Route53 record created.

Test:

curl http://nginx-test.yourdomain.com

Expected:

Welcome to nginx!

🟢 PHASE 4 — SCALING TESTS

Scenario 1 — Scale Up

kubectl scale deployment nginx-test --replicas=5

Expected:

Pods increase
If no capacity → Karpenter provisions new ARM node
NO downtime

Check:

kubectl get nodes

Check downtime log → should remain 200.

Scenario 2 — Scale Down

kubectl scale deployment nginx-test --replicas=1

Expected:

Pods terminate gracefully
No downtime
Extra nodes may remain (due to conservative consolidation)

Scenario 3 — Single Replica Risk

With replicas=1:

kubectl delete pod <pod-name>

Expected:

1–5 sec 503 spike
Pod recreated

This proves:

Production MUST use replicas ≥2.

🟢 PHASE 5 — NODE FAILURE TESTS

Scenario 4 — Manual Node Drain

kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

Expected:

Pods rescheduled
No downtime (if replicas ≥2)
Node may stay due to nodes: "0"

Check downtime log.

Scenario 5 — EC2 Hard Kill

aws ec2 terminate-instances --instance-ids <id>

Expected:

Node NotReady
Karpenter provisions new ARM node
Pods rescheduled
Small recovery window

Measure:

Time node ready
Time pod ready
Downtime seconds

Target: < 60 sec total.

🟢 PHASE 6 — KARPENTER DISRUPTION TEST

Temporarily change:

nodes: "1"

Apply.

Then trigger drift:

kubectl annotate node <node> karpenter.sh/disruption=drift

Expected:

Only 1 node replaced
No mass eviction
No downtime

Revert config after test.

🟢 PHASE 7 — CONSOLIDATION TEST

Temporarily set:

consolidateAfter: 2m

Scale down replicas to 1.

Wait.

Expected:

Empty node removed
Running node untouched
No downtime

🟢 PHASE 8 — LOAD TEST

Install hey:

sudo apt install hey

Run:

hey -n 20000 -c 500 http://nginx-test.yourdomain.com

Monitor:

kubectl top nodes
kubectl top pods

Expected:

No 5xx
CPU stable
No pod crashes

🟢 PHASE 9 — INGRESS CONTROLLER FAILURE

Delete one ingress controller pod:

kubectl delete pod -n ingress-nginx <pod-name>

Expected:

No downtime
Because replicas ≥2

If downtime occurs → ingress HA issue.

🟢 PHASE 10 — DNS FAILURE TEST

Temporarily delete Route53 record.

Check:

How long clients fail?
TTL behavior

Restore record.

This tests real-world DNS dependency.

🛡️ MANDATORY PRODUCTION SAFETY

Add PodDisruptionBudget:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: nginx-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: nginx-test

📊 WHAT YOU DOCUMENT

For each scenario record:

Scenario	Downtime	Node Replacement Time	Pod Ready Time
Scale up	0s	X	X
Scale down	0s	-	X
Pod kill	2s	-	8s
Node drain	0s	X	X
EC2 kill	20s	45s	12s
Drift replace	0s	X	X

🎯 FINAL ACCEPTANCE CRITERIA

Production ready if:

✔ No downtime during scaling ✔ No downtime during node drain ✔ Max 60 sec during EC2 crash ✔ No mass eviction ✔ Only 1 node replaced during maintenance ✔ No surprise replacement

🚨 MOST IMPORTANT THINGS

For real zero downtime:

Ingress controller replicas ≥2
App replicas ≥2
PDB enabled
Readiness probes configured
Graceful termination period set
Proper health checks in ELB

Without readiness probes, downtime may happen even with 2 replicas.

🔥 Your Setup Verdict

Your NodePool config is:

Stability-first
No surprise expiry
No spot interruption
Controlled maintenance
Production-safe

Entity Migrate K9s Workload