Skip to main content

Command Palette

Search for a command to run...

Troubleshoot Kubernetes Events in TKE with Tencent Cloud CLS

Use Kubernetes Event logs to find abnormal nodes, Pod evictions, and cluster autoscaler decisions without waiting for business impact.

Updated
5 min read
Troubleshoot Kubernetes Events in TKE with Tencent Cloud CLS

Kubernetes incidents often start as small state changes: a node reports disk pressure, a Pod is evicted, a scheduler decision fails, or a cluster autoscaler adds capacity. If those signals are only noticed after the workload is affected, the best troubleshooting window has already passed.

Tencent Kubernetes Engine events record those cluster state changes. Tencent Cloud CLS turns the event stream into searchable logs, dashboards, and operational views so teams can inspect what changed, when it changed, and which Kubernetes object was involved.

What Kubernetes Events Add To Cluster Operations

TKE supports Tencent Kubernetes Engine, Elastic Kubernetes Service, and TKE Edge. Across those deployment models, Kubernetes Events are a lightweight but high-value signal because they describe state transitions rather than only raw resource metrics.

The common fields are:

Event field Meaning
Type Usually Normal or Warning, with custom types possible.
Involved Object The object related to the event, such as Pod, Deployment, or Node.
Source The component that reported the event, such as Scheduler or Kubelet.
Reason A short enum-style description used internally by components.
Message The detailed event message.
Count How many times the event occurred.

You can inspect similar information with kubectl describe, but centralizing it in CLS makes the data searchable across clusters and time windows. The useful mental model is: Kubernetes emits a state-change record, CLS stores it as a log event, and operators search by object, component, reason, message, count, and timestamp.

CLS provides collection, storage, search, and analytics for Kubernetes event logs. In the TKE console, operators can open Cluster Operations -> Event Search and use two entry points:

Entry point Best use
Event overview Start from warning volume, affected resource objects, and event trends.
Global search Run field-level queries and reconstruct a detailed event timeline.

The overview is the broad health view. It helps you see whether the cluster is dominated by node warnings, Pod scheduling issues, kubelet actions, or autoscaler decisions before narrowing the search.

Scenario 1: Find Why A Node Became Abnormal

When one node becomes unhealthy, start from the event overview and filter by the abnormal node name. In this example, the matching record is a node disk space insufficient event.

The timeline then shows that on 2020-11-25, node 172.16.18.13 entered an abnormal state because disk space was insufficient. After that, kubelet began evicting Pods from the node to reclaim local disk. The operational reading is straightforward: disk pressure appears first, kubelet eviction follows, and the next check should focus on node disk usage, eviction thresholds, and workload placement.

This is the useful part of event-based troubleshooting: the event stream connects the visible node condition to the component action that followed it.

Question Event evidence to inspect
Which node changed state? event.involvedObject.name
What was the immediate reason? event.reason and event.message
Which component reported it? event.source.component
Did it repeat? event.count and event trend

Scenario 2: Reconstruct Cluster Autoscaler Expansion

For clusters with node pool autoscaling enabled, the cluster autoscaler can add or remove nodes based on workload pressure. If nodes are added automatically, the question becomes: which Pods triggered the expansion and why did it stop?

Use global search with the autoscaler component:

event.source.component:"cluster-autoscaler"

Then display fields such as event.reason, event.message, and event.involvedObject.name, and sort by log time descending. The result should read like an event ledger: each row connects a workload object, autoscaler decision, and message explaining whether a node was added, skipped, or blocked by a limit.

The event stream shows expansion around 2020-11-25 20:35:45. Three nginx Pods triggered the scale-out:

  • nginx-5dbf784b68-tq8rd
  • nginx-5dbf784b68-fpvbx
  • nginx-5dbf784b68-v9jv5

The cluster eventually added three nodes. Later expansion did not continue because the node pool had reached its maximum node count.

Practical Runbook

  1. Open the TKE event search page.
  2. Start with the event overview for broad health and warning distribution.
  3. Filter by abnormal node, Pod, Deployment, or component name.
  4. For autoscaling investigations, query cluster-autoscaler.
  5. Add event.reason, event.message, and event.involvedObject.name to the visible fields.
  6. Sort by log time descending to reconstruct the latest state transitions.
  7. Use the event chain to decide whether the next action is node cleanup, Pod rescheduling, node pool limit adjustment, or deeper workload debugging.

FAQ

Are Kubernetes Events a replacement for metrics?

No. Metrics explain resource levels and trends. Events explain state changes and component decisions. They are strongest when used together.

Why send Events to CLS instead of only using kubectl describe?

CLS gives a central searchable history, dashboards, filtering, and field-level analysis across time windows. That is more practical when the problem spans multiple nodes or happened earlier.

Which event fields matter first during an incident?

Start with object name, source component, reason, message, count, and log time.