OpenShift Runbook
OpenShift Runbook
Root Cause Analysis — Node NotReady
Version: 1.0
Generated on: 2026-02-19 13:49:14
Audience: Operations / SRE / Platform Team
1. Purpose
This runbook provides a structured procedure to identify the root cause of an OpenShift node entering NotReady state.
It covers:
- Node conditions analysis
- Kubelet health validation
- Resource pressure (CPU, Memory, Disk)
- Storage and IO validation
- Runtime (CRI-O) and Kubelet inspection
- MachineConfigPool verification
- Prometheus forensic queries
2. Initial Setup
Set the node name:
NODE=<node-name>Example:
NODE=ocp4-worker-03. Step 1 — Node Status & Conditions
oc get node $NODE -o wideoc describe node $NODE | sed -n '/Conditions:/,/Addresses:/p'oc describe node $NODE | sed -n '/Events:/,$p'What to look for
- Ready=False
- KubeletNotReady
- MemoryPressure=True
- DiskPressure=True
- PIDPressure=True
- NetworkUnavailable=True
4. Step 2 — Kubelet Health (API Proxy Check)
oc get --raw /api/v1/nodes/$NODE/proxy/healthz ; echooc get --raw /api/v1/nodes/$NODE/proxy/stats/summary | headResult Interpretation
Timeout kubelet down / node frozen 200 OK kubelet running Stats error runtime/storage issue
5. Step 3 — Node Debug Inspection
oc debug node/$NODE -- chroot /host bashInside debug shell:
5.1 Reboot Check
uptimewho -bjournalctl -b -1 -n 200 --no-pager5.2 Memory Check
free -mvmstat 1 5journalctl -k | egrep -i "oom|out of memory"5.3 Disk & Inode Check
df -hTdf -hiCritical paths:
- /
- /var/lib/containers
- /var/lib/kubelet
If usage >85% → investigate immediately.
5.4 Storage / IO Errors
journalctl -k | egrep -i "I/O error|blocked for more than|hung|reset|xfs|ext4"5.5 Kubelet Logs
journalctl -u kubelet -n 200 --no-pagerLook for:
- PLEG is not healthy
- Container runtime is down
- DeadlineExceeded
- Eviction manager errors
5.6 CRI-O Logs
journalctl -u crio -n 200 --no-pagerLook for:
- storage timeouts
- overlay errors
- rpc timeout
5.7 Runtime Check
crictl ps -a6. Step 4 — Cluster Resource Pressure
oc adm top nodesoc adm top pods -A --sort-by=memory | head -20oc adm top pods -A --sort-by=cpu | head -207. Step 5 — MachineConfigPool
oc get mcpoc describe mcp workerIf:
- Updating=True
- Degraded=True
Node may be rebooting due to MCO update.
8. Prometheus Investigation (Observe → Metrics)
8.1 Node Ready State
kube_node_status_condition{condition="Ready"}8.2 Nodes NotReady
kube_node_status_condition{condition="Ready",status="false"}8.3 Memory Usage %
100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))8.4 OOM Events
increase(node_vmstat_oom_kill[1h])8.5 Root Filesystem Usage
100 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 1008.6 Container Runtime Disk Usage
100 - (node_filesystem_avail_bytes{mountpoint="/var/lib/containers"} / node_filesystem_size_bytes{mountpoint="/var/lib/containers"}) * 1008.7 Disk IO Saturation
rate(node_disk_io_time_seconds_total[5m])8.8 Kubelet Restart Detection
changes(process_start_time_seconds{job="kubelet"}[1h])9. Common Root Causes
Symptom Root Cause
healthz timeout kubelet crash / node frozen OOM events Memory exhaustion DiskPressure Disk full IO wait high Storage backend latency kubelet restarts runtime crash MCP Updating MCO-triggered reboot
10. Escalation Guidance
Escalate to:
- Infrastructure team → IO errors, storage latency, VM reboot
- Application team → memory leaks, high CPU
- Platform team → MCO degraded, kubelet crash loop
11. Recommended Immediate Actions
If workload impacted:
oc adm cordon $NODEoc adm drain $NODE --ignore-daemonsets --delete-emptydir-data --force --grace-period=60 --timeout=10mThen investigate offline.
12. Documentation
When closing incident:
- Capture Prometheus graphs
- Attach kubelet logs
- Attach df output
- Document timeline (first NotReady occurrence)