Real-Time CDN Log Analysis with Tencent Cloud CLS

CDN logs are one of the fastest ways to understand whether a traffic spike is healthy growth, cache pressure, regional imbalance, or an application problem hiding behind edge traffic.

Default CDN monitoring usually covers basic metrics such as request count and bandwidth. That is useful, but it is not enough for interactive troubleshooting. If teams download raw CDN logs and analyze them offline, they often pay for extra infrastructure, delayed data, and slower response during incidents.

Tencent Cloud CDN can deliver access logs into Tencent Cloud Log Service (CLS). Once the logs are in CLS, teams can use search, SQL analysis, dashboards, and alerting for real-time CDN quality and performance monitoring.

Why Send CDN Logs to CLS?

The source workflow highlights four capabilities:

Capability	Operational value
One-click log delivery	CDN access logs become available in CLS without building a separate offline ingestion pipeline.
Second-level analysis at large scale	Teams can run interactive queries instead of waiting for delayed batch processing.
Real-time dashboard visualization	CDN quality, cache behavior, bandwidth, and errors can be watched continuously.
One-minute alerting	Latency or error conditions can trigger notifications quickly.

This makes CLS a better fit for scenarios such as real-time issue localization, fast validation, CDN alerting, and customized performance analysis.

CDN Log Fields Worth Indexing

Index the fields that answer performance, traffic, and error questions.

Field	CLS type	Meaning
`app_id`	`long`	Tencent Cloud account APPID.
`client_ip`	`text`	Client IP.
`file_size`	`long`	File size.
`hit`	`text`	Cache `HIT` or `MISS`; edge-node and parent-node hits are both marked as `HIT`.
`host`	`text`	Domain.
`http_code`	`long`	HTTP status code.
`isp`	`text`	ISP.
`method`	`text`	HTTP method.
`param`	`text`	URL parameters.
`proto`	`text`	HTTP protocol identifier.
`prov`	`text`	ISP province.
`referer`	`text`	HTTP referer.
`request_range`	`text`	Request range.
`request_time`	`long`	Response time in milliseconds, from receiving the request to finishing the response to the client.
`request_port`	`long`	Client-to-CDN-node connection port, or `-` when unavailable.
`rsp_size`	`long`	Response size.
`time`	`long`	Request time as a Unix timestamp in seconds.
`ua`	`text`	User-Agent.
`url`	`text`	Request path.
`uuid`	`text`	Unique request identifier.
`version`	`long`	CDN real-time log version.

For operational dashboards, prioritize request_time, http_code, hit, host, url, client_ip, rsp_size, prov, and isp.

Monitor High CDN Latency with Percentiles

Averages hide tail latency. The source workflow recommends percentile-based monitoring, especially p99 latency, because a small number of slow requests can be smoothed away by average values.

Use a time-series query to compare average, p50, and p99 latency over a day-level window.

* |
select
  avg(request_time) as l,
  approx_percentile(request_time, 0.5) as p50,
  approx_percentile(request_time, 0.99) as p99,
  time_series(__TIMESTAMP__, '5m', '%Y-%m-%d %H:%i:%s', '0') as time
group by time
order by time desc
limit 1440

For alerting, reduce the query to the signal the rule needs:

* |
select
  approx_percentile(request_time, 0.99) as p99

A practical alert condition is p99 latency greater than a threshold such as 100 ms, with the affected host, url, and client_ip included in the notification through multidimensional analysis.

Detect Error Spikes Minute by Minute

When page access errors increase sharply, the cause may be backend failure, overload, or a sudden change in traffic quality. The source workflow monitors the difference between the latest minute and the previous minute.

Latest minute error count:

* |
select *
from (
  select *
  from (
    select *
    from (
      select date_trunc('minute', __TIMESTAMP__) as time,
             count(*) as errct
      where http_code >= 400
      group by time
      order by time desc
      limit 2
    )
  )
  order by time desc
  limit 1
)

Previous minute error count:

* |
select *
from (
  select *
  from (
    select *
    from (
      select date_trunc('minute', __TIMESTAMP__) as time,
             count(*) as errct
      where http_code >= 400
      group by time
      order by time desc
      limit 2
    )
  )
  order by time asc
  limit 1
)

The alert condition is:

latest minute error count - previous minute error count > configured threshold

Route the notification to the operations channel that handles CDN incidents. Include the domain, URL, client IP, and HTTP code distribution so the responder can decide whether the issue is global, domain-specific, or resource-specific.

Build CDN Dashboards for Performance and Quality

The dashboard layer should turn log data into a small set of recurring questions.

Dashboard panel	Suggested dimensions	Question it answers
Health score or overall status	time range, domain	Is the CDN service healthy right now?
Cache hit ratio	`hit`, `host`	Are requests served from cache as expected?
Average downstream bandwidth	`rsp_size`, time bucket, domain	Is bandwidth pressure rising during the traffic peak?
HTTP status distribution	`http_code`, time bucket	Are 4xx or 5xx responses increasing?
Top URLs by traffic or errors	`url`, `rsp_size`, `http_code`	Which resource is driving load or failure?
Client and regional distribution	`client_ip`, `prov`, `isp`	Is the issue concentrated by geography or ISP?

The visual workflow in CLS supports query results, statistic cards, line charts, bar charts, and distribution views. A useful dashboard combines one top-level health view with drill-down panels for cache, bandwidth, status codes, and affected resources.

Incident Runbook

When a CDN traffic peak starts, use this order:

Check p99 request_time, not only average latency.
Compare the current minute error count with the previous minute.
Split errors by host, url, and http_code.
Check hit to understand whether cache behavior changed.
Review rsp_size and traffic volume to separate traffic growth from error growth.
Add client_ip, prov, and isp when the issue may be regional or network-specific.
Turn the final query into an alert if it represents a recurring operational risk.

FAQ

Why not rely only on CDN default metrics?

Default metrics are useful for broad visibility, but log analysis is better when the team needs custom dimensions such as URL, client IP, referer, ISP, cache hit state, or request-level latency.

Why use p99 latency for alerting?

p99 preserves tail latency. Averages can look normal while a meaningful subset of users experiences slow responses.

What fields should be included in alert messages?

Use the affected host, url, client_ip, http_code, and latency metric. These fields help responders judge impact and choose the next query.

Real-Time CDN Log Analysis with Tencent Cloud CLS

Why Send CDN Logs to CLS?

CDN Log Fields Worth Indexing

Monitor High CDN Latency with Percentiles

Detect Error Spikes Minute by Minute

Build CDN Dashboards for Performance and Quality

Incident Runbook

FAQ

Why not rely only on CDN default metrics?

Why use p99 latency for alerting?

What fields should be included in alert messages?

Comments

More from this blog

Troubleshoot Kubernetes Events in TKE with Tencent Cloud CLS

Manage Cloud Product Logs from Tencent Cloud Advisor with CLS

Build a Large-Scale Observability Platform on Tencent Cloud CLS: The Beike Case

Detect Malicious IPs in Cloud Access Logs with Tencent Cloud CLS

Deliver CLS Logs to Tencent Cloud DLC for Spark-Based Analysis

Command Palette

Why Send CDN Logs to CLS?

CDN Log Fields Worth Indexing

Monitor High CDN Latency with Percentiles

Detect Error Spikes Minute by Minute

Build CDN Dashboards for Performance and Quality

Incident Runbook

FAQ

Why not rely only on CDN default metrics?

Why use p99 latency for alerting?

What fields should be included in alert messages?

Comments

More from this blog