🚨 Built a Production Kubernetes Alerting Pipeline → Google Chat in under an hour

After getting tired of missing critical alerts, I set up a complete alerting pipeline on our production EKS cluster that sends 4xx/5xx errors, latency spikes, and pod failures directly to Google Chat — zero third-party SaaS needed.

Here's the full setup 👇

The Stack:

Prometheus + kube-prometheus-stack

Custom PrometheusRule (PromQL)

Alertmanager routing

Lightweight Python/Flask webhook bridge

Google Chat webhook

Step 1 — PrometheusRule (what fires the alerts)

Fires on real signal only — 401s and 404s excluded from 4xx (they're noise):

apiVersion: monitoring.coreos.com/v1

kind: PrometheusRule

metadata:

namespace: monitoring

labels:

release: prometheus

spec:

groups:

interval: 30s

rules:

alert: High5xxErrorRate

expr: |

(

sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) by (application, namespace, service)

/

sum(rate(http_server_requests_seconds_count[5m])) by (application, namespace, service)

) > 0.10

for: 5m

labels:

severity: critical

alert: High4xxErrorRate

expr: |

(

sum(rate(http_server_requests_seconds_count{status=~"4..", status!~"401|404"}[5m])) by (application, namespace, service)

/

sum(rate(http_server_requests_seconds_count[5m])) by (application, namespace, service)

) > 0.20

for: 5m

labels:

severity: warning

alert: High5xxErrorRatioPerURI

expr: |

(

sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) by (application, namespace, service, uri)

/

clamp_min(sum(rate(http_server_requests_seconds_count[5m])) by (application, namespace, service, uri), 1)

) > 0.10

for: 5m

labels:

severity: critical

alert: High4xxErrorRatioPerURI

expr: |

(

sum(rate(http_server_requests_seconds_count{status=~"4..", status!~"401|404"}[5m])) by (application, namespace, service, uri)

/

clamp_min(sum(rate(http_server_requests_seconds_count[5m])) by (application, namespace, service, uri), 1)

) > 0.25

for: 5m

labels:

severity: warning

Step 2 — GChat Bridge (the webhook microservice)

A tiny Python/Flask container that receives Alertmanager POSTs and forwards to Google Chat:

apiVersion: v1

kind: Secret

metadata:

namespace: monitoring

type: Opaque

stringData:

WEBHOOK_URL: "https://chat.googleapis.com/v1/spaces/YOUR_SPACE/messages?key=YOUR_KEY&token=YOUR_TOKEN (opens in a new tab)"

apiVersion: apps/v1

kind: Deployment

metadata:

namespace: monitoring

labels:

app: alertmanager-gchat-bridge

spec:

replicas: 1

selector:

matchLabels:

  app: alertmanager-gchat-bridge

template:

metadata:

  labels:

    app: alertmanager-gchat-bridge

spec:

  containers:

    - name: gchat-bridge

      image: python:3.11-slim

      command: ["sh", "-c"]

      args:

        - |

          pip install --no-cache-dir flask requests && python - <<'EOF'

          from flask import Flask, request

          import requests, os, logging, time

          logging.basicConfig(level=logging.INFO)

          app = Flask(__name__)

          WEBHOOK_URL = os.environ['WEBHOOK_URL']



          @app.route('/health', methods=['GET'])

          def health():

              return 'OK', 200



          @app.route('/alerts', methods=['POST'])

          def alerts():

              try:

                  data = request.json

                  alerts = data.get('alerts', [])

                  logging.info(f"Received {len(alerts)} alerts")

                  text = ""

                  for a in alerts:

                      labels = a.get('labels', {})

                      annotations = a.get('annotations', {})

                      status = a.get('status', 'firing')

                      icon = "✅" if status == "resolved" else "🚨"

                      text += f"{icon} *{labels.get('alertname','N/A')}* ({status.upper()})\n"

                      text += f"Severity: {labels.get('severity','N/A')}\n"

                      text += f"Namespace: {labels.get('namespace','N/A')}\n"

                      text += f"Service: {labels.get('service','N/A')}\n"

                      text += f"{annotations.get('description', annotations.get('summary',''))}\n\n"

                  if not text:

                      return 'OK', 200

                  for i in range(3):

                      try:

                          resp = requests.post(WEBHOOK_URL, json={"text": text}, timeout=10)

                          logging.info(f"Sent to GChat: {resp.status_code}")

                          break

                      except Exception as e:

                          logging.error(f"Retry {i+1} failed: {e}")

                          time.sleep(2)

                  return 'OK', 200

              except Exception as e:

                  logging.error(f"Error: {e}")

                  return 'ERROR', 500



          app.run(host='0.0.0.0', port=8090)

          EOF

      ports:

        - containerPort: 8090

      env:

        - name: WEBHOOK_URL

          valueFrom:

            secretKeyRef:

              name: gchat-webhook

              key: WEBHOOK_URL

      resources:

        requests:

          cpu: 10m

          memory: 32Mi

        limits:

          cpu: 100m

          memory: 128Mi

      livenessProbe:

        httpGet:

          path: /health

          port: 8090

        initialDelaySeconds: 15

        periodSeconds: 30

      readinessProbe:

        httpGet:

          path: /health

          port: 8090

        initialDelaySeconds: 10

        periodSeconds: 10

apiVersion: v1

kind: Service

metadata:

namespace: monitoring

spec:

type: ClusterIP

selector:

app: alertmanager-gchat-bridge

ports:

- port: 8090

  targetPort: 8090

Step 3 — Alertmanager Config (routing rules)

Key design decisions:

Default receiver is null → unmatched alerts are silently dropped (no noise)

group_by: [alertname, namespace] → no "null" grouping issue

repeat_interval: 4h → no hourly spam during long incidents

global:

resolve_timeout: 5m

inhibit_rules:

source_matchers:
- severity="critical"
target_matchers:
- severity=~"warning|info"
equal: [namespace, service]
source_matchers:
- severity="warning"
target_matchers:
- severity="info"
equal: [namespace, service]

route:

receiver: "null"

group_by: [alertname, namespace]

group_wait: 30s

group_interval: 5m

repeat_interval: 4h

routes:

- receiver: google-chat-alerts

  matchers:

    - alertname=~"High5xxErrorRate|High4xxErrorRate|High5xxErrorRatioPerURI|High4xxErrorRatioPerURI|KubePodCrashLooping|KubePodNotReady|KubeDeploymentReplicasMismatch|KubeHpaMaxedOut|KubeJobFailed|KubeContainerWaiting"

receivers:

name: "null"
name: "google-chat-alerts"

webhook_configs:
- url: "http://alertmanager-gchat-bridge.monitoring.svc.cluster.local:8090/alerts (opens in a new tab)"
  
  send_resolved: true

templates:

/etc/alertmanager/config/*.tmpl

Apply it:

1. Apply bridge

kubectl apply -f gchat-bridge.yaml

2. Apply PrometheusRule

kubectl apply -f prometheus-rule.yaml

3. Apply Alertmanager config

kubectl create secret generic alertmanager-prometheus-kube-prometheus-alertmanager \

--from-file=alertmanager.yaml=./alertmanager.yaml \

-n monitoring --dry-run=client -o yaml | kubectl replace -f -

4. Restart Alertmanager

kubectl rollout restart statefulset alertmanager-prometheus-kube-prometheus-alertmanager -n monitoring

5. Test it (fires instantly, no need to wait 5m)

curl -X POST http://localhost:9093/api/v2/alerts (opens in a new tab) \

-H "Content-Type: application/json" \

-d '[{

"labels": {

  "alertname": "High5xxErrorRate",

  "severity": "critical",

  "namespace": "test",

  "service": "test-service",

  "application": "test-app"

},

"annotations": {

  "description": "5xx error rate is 15% on test-service in test namespace"

},

"startsAt": "2026-05-07T10:00:00Z"

}]'

The result:

✅ Real-time 4xx/5xx alerts in Google Chat ✅ Pod crash alerts ✅ HPA maxed out alerts ✅ Resolved notifications (green checkmark) ✅ Zero SaaS cost — fully self-hosted ✅ ~10m memory, ~10m CPU footprint

The bridge is completely decoupled — changing PrometheusRule or Alertmanager config doesn't require redeploying the bridge.

What alerting setup are you running on your Kubernetes clusters? Would love to hear what works for your team 👇

#Kubernetes #DevOps #Monitoring #Prometheus #Alertmanager #GoogleChat #EKS #SRE #Platform #CloudNative #k8s

Introduction Basic