1.kubernetes
Karpenter Test

πŸ§ͺ Karpenter ARM NodePool + NGINX HA Validation:

  • βœ… Karpenter NodePool
  • βœ… NGINX Deployment
  • βœ… ExternalDNS
  • βœ… Route53
  • βœ… Scaling
  • βœ… Node failure
  • βœ… EC2 crash
  • βœ… Maintenance window
  • βœ… Consolidation
  • βœ… Downtime measurement
  • βœ… Worst-case scenarios

This is written like real production change documentation.


πŸ“˜ PRODUCTION SOP

Title: Karpenter ARM NodePool + NGINX High Availability Validation Environment: Production Objective: Ensure ZERO or minimal downtime under all scenarios.


🧩 ARCHITECTURE OVERVIEW

User
 ↓
Route53
 ↓
ExternalDNS
 ↓
ELB (NGINX Controller)
 ↓
NGINX Ingress
 ↓
Service (ClusterIP)
 ↓
Pod (nginx-test)
 ↓
Karpenter ARM Node

🟒 PHASE 1 β€” PRE-CHECK (MANDATORY)

1.1 Confirm Ingress Controller

kubectl get svc -n ingress-nginx

Must show:

TYPE: LoadBalancer
EXTERNAL-IP: *.elb.amazonaws.com

If not β†’ STOP.


1.2 Confirm ExternalDNS Running

kubectl get pods -n external-dns

If not running β†’ Route53 won’t auto update.


1.3 Confirm Karpenter Healthy

kubectl get pods -n karpenter
kubectl get nodepool

1.4 Start Downtime Monitor (VERY IMPORTANT)

From bastion:

while true; do
  date >> downtime.log
  curl -s -o /dev/null -w "%{http_code} %{time_total}\n" http://nginx-test.yourdomain.com >> downtime.log
  sleep 1
done

This detects:

  • 503
  • 504
  • Time spikes
  • Any failure

This log = final truth.


🟒 PHASE 2 β€” APPLY KARPENTER NODEPOOL

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  annotations:
    karpenter.sh/nodepool-hash: "7048599442334880678"
    karpenter.sh/nodepool-hash-version: v3
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"karpenter.sh/v1","kind":"NodePool","metadata":{"annotations":{"kubernetes.io/description":"Graviton NodePool for ARM-based workloads"},"name":"general-purpose-arm"},"spec":{"template":{"metadata":{"labels":{"role":"shopify-prod"}},"spec":{"nodeClassRef":{"group":"karpenter.k8s.aws","kind":"EC2NodeClass","name":"graviton-default"},"requirements":[{"key":"kubernetes.io/arch","operator":"In","values":["arm64"]},{"key":"kubernetes.io/os","operator":"In","values":["linux"]},{"key":"karpenter.sh/capacity-type","operator":"In","values":["on-demand"]},{"key":"karpenter.k8s.aws/instance-category","operator":"In","values":["r"]},{"key":"karpenter.k8s.aws/instance-generation","operator":"Gt","values":["6"]}]}}}}
    kubernetes.io/description: Graviton NodePool for ARM-based workloads
  creationTimestamp: "2026-02-12T06:07:19Z"
  generation: 4
  name: general-purpose-arm
  resourceVersion: "12529914"
  uid: 723af7cb-28dc-4a89-b0d8-c17c47348864
spec:
  disruption:
    budgets:
    - nodes: 10%
    consolidateAfter: 1h
    consolidationPolicy: WhenEmpty
  template:
    metadata:
      labels:
        role: shopify-prod
    spec:
      expireAfter: 720h
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: graviton-default
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values:
        - arm64
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - on-demand
      - key: karpenter.k8s.aws/instance-category
        operator: In
        values:
        - r
      - key: karpenter.k8s.aws/instance-generation
        operator: Gt
        values:
        - "6"
status:
  conditions:
  - lastTransitionTime: "2026-02-12T06:07:19Z"
    message: ""
    observedGeneration: 4
    reason: ValidationSucceeded
    status: "True"
    type: ValidationSucceeded
  - lastTransitionTime: "2026-02-12T06:07:34Z"
    message: ""
    observedGeneration: 4
    reason: NodeClassReady
    status: "True"
    type: NodeClassReady
  - lastTransitionTime: "2026-02-18T19:28:49Z"
    message: object is awaiting reconciliation
    observedGeneration: 4
    reason: AwaitingReconciliation
    status: Unknown
    type: NodeRegistrationHealthy
  - lastTransitionTime: "2026-02-18T19:28:49Z"
    message: ""
    observedGeneration: 4
    reason: Ready
    status: "True"
    type: Ready
  nodeClassObservedGeneration: 1
  resources:
    cpu: "0"
    ephemeral-storage: "0"
    memory: "0"
    nodes: "0"
    pods: "0"
 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-test
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx-test
  template:
    metadata:
      labels:
        app: nginx-test
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80
----------------
apiVersion: v1
kind: Service
metadata:
  name: nginx-test
spec:
  selector:
    app: nginx-test
  ports:
  - port: 80
    targetPort: 80
  type: ClusterIP
------------------
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: nginx-test
  annotations:
    kubernetes.io/ingress.class: nginx
    external-dns.alpha.kubernetes.io/hostname: nginx-test.yourdomain.com
spec:
  rules:
  - host: nginx-test.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: nginx-test
            port:
              number: 80
 

Apply:

kubectl apply -f karpenter-nginx-test.yaml

Verify:

kubectl describe nodepool general-purpose-arm

Expected:

  • No immediate disruption
  • No node termination

Because:

nodes: "0"
expireAfter: Never

If nodes restart β†’ configuration error.


🟒 PHASE 3 β€” DEPLOY NGINX APP

Apply:

kubectl apply -f nginx-deployment.yaml
kubectl apply -f nginx-service.yaml
kubectl apply -f nginx-ingress.yaml

Verify:

kubectl get pods -o wide
kubectl get ingress

Verify Route53 record created.

Test:

curl http://nginx-test.yourdomain.com

Expected:

Welcome to nginx!

🟒 PHASE 4 β€” SCALING TESTS


Scenario 1 β€” Scale Up

kubectl scale deployment nginx-test --replicas=5

Expected:

  • Pods increase
  • If no capacity β†’ Karpenter provisions new ARM node
  • NO downtime

Check:

kubectl get nodes

Check downtime log β†’ should remain 200.


Scenario 2 β€” Scale Down

kubectl scale deployment nginx-test --replicas=1

Expected:

  • Pods terminate gracefully
  • No downtime
  • Extra nodes may remain (due to conservative consolidation)

Scenario 3 β€” Single Replica Risk

With replicas=1:

kubectl delete pod <pod-name>

Expected:

  • 1–5 sec 503 spike
  • Pod recreated

This proves:

Production MUST use replicas β‰₯2.


🟒 PHASE 5 β€” NODE FAILURE TESTS


Scenario 4 β€” Manual Node Drain

kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

Expected:

  • Pods rescheduled
  • No downtime (if replicas β‰₯2)
  • Node may stay due to nodes: "0"

Check downtime log.


Scenario 5 β€” EC2 Hard Kill

aws ec2 terminate-instances --instance-ids <id>

Expected:

  • Node NotReady
  • Karpenter provisions new ARM node
  • Pods rescheduled
  • Small recovery window

Measure:

  • Time node ready
  • Time pod ready
  • Downtime seconds

Target: < 60 sec total.


🟒 PHASE 6 β€” KARPENTER DISRUPTION TEST

Temporarily change:

nodes: "1"

Apply.

Then trigger drift:

kubectl annotate node <node> karpenter.sh/disruption=drift

Expected:

  • Only 1 node replaced
  • No mass eviction
  • No downtime

Revert config after test.


🟒 PHASE 7 β€” CONSOLIDATION TEST

Temporarily set:

consolidateAfter: 2m

Scale down replicas to 1.

Wait.

Expected:

  • Empty node removed
  • Running node untouched
  • No downtime

🟒 PHASE 8 β€” LOAD TEST

Install hey:

sudo apt install hey

Run:

hey -n 20000 -c 500 http://nginx-test.yourdomain.com

Monitor:

kubectl top nodes
kubectl top pods

Expected:

  • No 5xx
  • CPU stable
  • No pod crashes

🟒 PHASE 9 β€” INGRESS CONTROLLER FAILURE

Delete one ingress controller pod:

kubectl delete pod -n ingress-nginx <pod-name>

Expected:

  • No downtime
  • Because replicas β‰₯2

If downtime occurs β†’ ingress HA issue.


🟒 PHASE 10 β€” DNS FAILURE TEST

Temporarily delete Route53 record.

Check:

  • How long clients fail?
  • TTL behavior

Restore record.

This tests real-world DNS dependency.


πŸ›‘οΈ MANDATORY PRODUCTION SAFETY

Add PodDisruptionBudget:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: nginx-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: nginx-test

πŸ“Š WHAT YOU DOCUMENT

For each scenario record:

ScenarioDowntimeNode Replacement TimePod Ready Time
Scale up0sXX
Scale down0s-X
Pod kill2s-8s
Node drain0sXX
EC2 kill20s45s12s
Drift replace0sXX

🎯 FINAL ACCEPTANCE CRITERIA

Production ready if:

βœ” No downtime during scaling βœ” No downtime during node drain βœ” Max 60 sec during EC2 crash βœ” No mass eviction βœ” Only 1 node replaced during maintenance βœ” No surprise replacement


🚨 MOST IMPORTANT THINGS

For real zero downtime:

  • Ingress controller replicas β‰₯2
  • App replicas β‰₯2
  • PDB enabled
  • Readiness probes configured
  • Graceful termination period set
  • Proper health checks in ELB

Without readiness probes, downtime may happen even with 2 replicas.


πŸ”₯ Your Setup Verdict

Your NodePool config is:

  • Stability-first
  • No surprise expiry
  • No spot interruption
  • Controlled maintenance
  • Production-safe