🚨 Built a Production Kubernetes Alerting Pipeline → Google Chat in under an hour
After getting tired of missing critical alerts, I set up a complete alerting pipeline on our production EKS cluster that sends 4xx/5xx errors, latency spikes, and pod failures directly to Google Chat — zero third-party SaaS needed.
Here's the full setup 👇
The Stack:
Prometheus + kube-prometheus-stack
Custom PrometheusRule (PromQL)
Alertmanager routing
Lightweight Python/Flask webhook bridge
Google Chat webhook
Step 1 — PrometheusRule (what fires the alerts)
Fires on real signal only — 401s and 404s excluded from 4xx (they're noise):
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: app-http-alerts
namespace: monitoring
labels:
release: prometheusspec:
groups:
-
name: http-error-rate-alerts
interval: 30s
rules:
-
alert: High5xxErrorRate
expr: |
(
sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) by (application, namespace, service) / sum(rate(http_server_requests_seconds_count[5m])) by (application, namespace, service)) > 0.10
for: 5m
labels:
severity: critical
-
alert: High4xxErrorRate
expr: |
(
sum(rate(http_server_requests_seconds_count{status=~"4..", status!~"401|404"}[5m])) by (application, namespace, service) / sum(rate(http_server_requests_seconds_count[5m])) by (application, namespace, service)) > 0.20
for: 5m
labels:
severity: warning
-
alert: High5xxErrorRatioPerURI
expr: |
(
sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) by (application, namespace, service, uri) / clamp_min(sum(rate(http_server_requests_seconds_count[5m])) by (application, namespace, service, uri), 1)) > 0.10
for: 5m
labels:
severity: critical
-
alert: High4xxErrorRatioPerURI
expr: |
(
sum(rate(http_server_requests_seconds_count{status=~"4..", status!~"401|404"}[5m])) by (application, namespace, service, uri) / clamp_min(sum(rate(http_server_requests_seconds_count[5m])) by (application, namespace, service, uri), 1)) > 0.25
for: 5m
labels:
severity: warning
-
Step 2 — GChat Bridge (the webhook microservice)
A tiny Python/Flask container that receives Alertmanager POSTs and forwards to Google Chat:
apiVersion: v1
kind: Secret
metadata:
name: gchat-webhook
namespace: monitoring
type: Opaque
stringData:
WEBHOOK_URL: "https://chat.googleapis.com/v1/spaces/YOUR_SPACE/messages?key=YOUR_KEY&token=YOUR_TOKEN (opens in a new tab)"
apiVersion: apps/v1
kind: Deployment
metadata:
name: alertmanager-gchat-bridge
namespace: monitoring
labels:
app: alertmanager-gchat-bridgespec:
replicas: 1
selector:
matchLabels:
app: alertmanager-gchat-bridgetemplate:
metadata:
labels:
app: alertmanager-gchat-bridge
spec:
containers:
- name: gchat-bridge
image: python:3.11-slim
command: ["sh", "-c"]
args:
- |
pip install --no-cache-dir flask requests && python - <<'EOF'
from flask import Flask, request
import requests, os, logging, time
logging.basicConfig(level=logging.INFO)
app = Flask(__name__)
WEBHOOK_URL = os.environ['WEBHOOK_URL']
@app.route('/health', methods=['GET'])
def health():
return 'OK', 200
@app.route('/alerts', methods=['POST'])
def alerts():
try:
data = request.json
alerts = data.get('alerts', [])
logging.info(f"Received {len(alerts)} alerts")
text = ""
for a in alerts:
labels = a.get('labels', {})
annotations = a.get('annotations', {})
status = a.get('status', 'firing')
icon = "✅" if status == "resolved" else "🚨"
text += f"{icon} *{labels.get('alertname','N/A')}* ({status.upper()})\n"
text += f"Severity: {labels.get('severity','N/A')}\n"
text += f"Namespace: {labels.get('namespace','N/A')}\n"
text += f"Service: {labels.get('service','N/A')}\n"
text += f"{annotations.get('description', annotations.get('summary',''))}\n\n"
if not text:
return 'OK', 200
for i in range(3):
try:
resp = requests.post(WEBHOOK_URL, json={"text": text}, timeout=10)
logging.info(f"Sent to GChat: {resp.status_code}")
break
except Exception as e:
logging.error(f"Retry {i+1} failed: {e}")
time.sleep(2)
return 'OK', 200
except Exception as e:
logging.error(f"Error: {e}")
return 'ERROR', 500
app.run(host='0.0.0.0', port=8090)
EOF
ports:
- containerPort: 8090
env:
- name: WEBHOOK_URL
valueFrom:
secretKeyRef:
name: gchat-webhook
key: WEBHOOK_URL
resources:
requests:
cpu: 10m
memory: 32Mi
limits:
cpu: 100m
memory: 128Mi
livenessProbe:
httpGet:
path: /health
port: 8090
initialDelaySeconds: 15
periodSeconds: 30
readinessProbe:
httpGet:
path: /health
port: 8090
initialDelaySeconds: 10
periodSeconds: 10apiVersion: v1
kind: Service
metadata:
name: alertmanager-gchat-bridge
namespace: monitoring
spec:
type: ClusterIP
selector:
app: alertmanager-gchat-bridgeports:
- port: 8090
targetPort: 8090Step 3 — Alertmanager Config (routing rules)
Key design decisions:
Default receiver is null → unmatched alerts are silently dropped (no noise)
group_by: [alertname, namespace] → no "null" grouping issue
repeat_interval: 4h → no hourly spam during long incidents
global:
resolve_timeout: 5m
inhibit_rules:
-
source_matchers:
- severity="critical"
target_matchers:
- severity=~"warning|info"
equal: [namespace, service]
-
source_matchers:
- severity="warning"
target_matchers:
- severity="info"
equal: [namespace, service]
route:
receiver: "null"
group_by: [alertname, namespace]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- receiver: google-chat-alerts
matchers:
- alertname=~"High5xxErrorRate|High4xxErrorRate|High5xxErrorRatioPerURI|High4xxErrorRatioPerURI|KubePodCrashLooping|KubePodNotReady|KubeDeploymentReplicasMismatch|KubeHpaMaxedOut|KubeJobFailed|KubeContainerWaiting"receivers:
-
name: "null"
-
name: "google-chat-alerts"
webhook_configs:
-
url: "http://alertmanager-gchat-bridge.monitoring.svc.cluster.local:8090/alerts (opens in a new tab)"
send_resolved: true
-
templates:
- /etc/alertmanager/config/*.tmpl
Apply it:
1. Apply bridge
kubectl apply -f gchat-bridge.yaml
2. Apply PrometheusRule
kubectl apply -f prometheus-rule.yaml
3. Apply Alertmanager config
kubectl create secret generic alertmanager-prometheus-kube-prometheus-alertmanager \
--from-file=alertmanager.yaml=./alertmanager.yaml \
-n monitoring --dry-run=client -o yaml | kubectl replace -f -
4. Restart Alertmanager
kubectl rollout restart statefulset alertmanager-prometheus-kube-prometheus-alertmanager -n monitoring
5. Test it (fires instantly, no need to wait 5m)
curl -X POST http://localhost:9093/api/v2/alerts (opens in a new tab) \
-H "Content-Type: application/json" \
-d '[{
"labels": {
"alertname": "High5xxErrorRate",
"severity": "critical",
"namespace": "test",
"service": "test-service",
"application": "test-app"
},
"annotations": {
"description": "5xx error rate is 15% on test-service in test namespace"
},
"startsAt": "2026-05-07T10:00:00Z"}]'
The result:
✅ Real-time 4xx/5xx alerts in Google Chat ✅ Pod crash alerts ✅ HPA maxed out alerts ✅ Resolved notifications (green checkmark) ✅ Zero SaaS cost — fully self-hosted ✅ ~10m memory, ~10m CPU footprint
The bridge is completely decoupled — changing PrometheusRule or Alertmanager config doesn't require redeploying the bridge.
What alerting setup are you running on your Kubernetes clusters? Would love to hear what works for your team 👇
#Kubernetes #DevOps #Monitoring #Prometheus #Alertmanager #GoogleChat #EKS #SRE #Platform #CloudNative #k8s