Skip to main content

Command Palette

Search for a command to run...

Real-Time CDN Log Analysis with Tencent Cloud CLS

Use CDN logs in CLS to monitor latency percentiles, detect error spikes, analyze cache performance, and build operational

Updated
6 min read
Real-Time CDN Log Analysis with Tencent Cloud CLS

CDN logs are one of the fastest ways to understand whether a traffic spike is healthy growth, cache pressure, regional imbalance, or an application problem hiding behind edge traffic.

Default CDN monitoring usually covers basic metrics such as request count and bandwidth. That is useful, but it is not enough for interactive troubleshooting. If teams download raw CDN logs and analyze them offline, they often pay for extra infrastructure, delayed data, and slower response during incidents.

Tencent Cloud CDN can deliver access logs into Tencent Cloud Log Service (CLS). Once the logs are in CLS, teams can use search, SQL analysis, dashboards, and alerting for real-time CDN quality and performance monitoring.

Why Send CDN Logs to CLS?

The source workflow highlights four capabilities:

Capability Operational value
One-click log delivery CDN access logs become available in CLS without building a separate offline ingestion pipeline.
Second-level analysis at large scale Teams can run interactive queries instead of waiting for delayed batch processing.
Real-time dashboard visualization CDN quality, cache behavior, bandwidth, and errors can be watched continuously.
One-minute alerting Latency or error conditions can trigger notifications quickly.

This makes CLS a better fit for scenarios such as real-time issue localization, fast validation, CDN alerting, and customized performance analysis.

CDN Log Fields Worth Indexing

Index the fields that answer performance, traffic, and error questions.

Field CLS type Meaning
app_id long Tencent Cloud account APPID.
client_ip text Client IP.
file_size long File size.
hit text Cache HIT or MISS; edge-node and parent-node hits are both marked as HIT.
host text Domain.
http_code long HTTP status code.
isp text ISP.
method text HTTP method.
param text URL parameters.
proto text HTTP protocol identifier.
prov text ISP province.
referer text HTTP referer.
request_range text Request range.
request_time long Response time in milliseconds, from receiving the request to finishing the response to the client.
request_port long Client-to-CDN-node connection port, or - when unavailable.
rsp_size long Response size.
time long Request time as a Unix timestamp in seconds.
ua text User-Agent.
url text Request path.
uuid text Unique request identifier.
version long CDN real-time log version.

For operational dashboards, prioritize request_time, http_code, hit, host, url, client_ip, rsp_size, prov, and isp.

Monitor High CDN Latency with Percentiles

Averages hide tail latency. The source workflow recommends percentile-based monitoring, especially p99 latency, because a small number of slow requests can be smoothed away by average values.

Use a time-series query to compare average, p50, and p99 latency over a day-level window.

* |
select
  avg(request_time) as l,
  approx_percentile(request_time, 0.5) as p50,
  approx_percentile(request_time, 0.99) as p99,
  time_series(__TIMESTAMP__, '5m', '%Y-%m-%d %H:%i:%s', '0') as time
group by time
order by time desc
limit 1440

For alerting, reduce the query to the signal the rule needs:

* |
select
  approx_percentile(request_time, 0.99) as p99

A practical alert condition is p99 latency greater than a threshold such as 100 ms, with the affected host, url, and client_ip included in the notification through multidimensional analysis.

Detect Error Spikes Minute by Minute

When page access errors increase sharply, the cause may be backend failure, overload, or a sudden change in traffic quality. The source workflow monitors the difference between the latest minute and the previous minute.

Latest minute error count:

* |
select *
from (
  select *
  from (
    select *
    from (
      select date_trunc('minute', __TIMESTAMP__) as time,
             count(*) as errct
      where http_code >= 400
      group by time
      order by time desc
      limit 2
    )
  )
  order by time desc
  limit 1
)

Previous minute error count:

* |
select *
from (
  select *
  from (
    select *
    from (
      select date_trunc('minute', __TIMESTAMP__) as time,
             count(*) as errct
      where http_code >= 400
      group by time
      order by time desc
      limit 2
    )
  )
  order by time asc
  limit 1
)

The alert condition is:

latest minute error count - previous minute error count > configured threshold

Route the notification to the operations channel that handles CDN incidents. Include the domain, URL, client IP, and HTTP code distribution so the responder can decide whether the issue is global, domain-specific, or resource-specific.

Build CDN Dashboards for Performance and Quality

The dashboard layer should turn log data into a small set of recurring questions.

Dashboard panel Suggested dimensions Question it answers
Health score or overall status time range, domain Is the CDN service healthy right now?
Cache hit ratio hit, host Are requests served from cache as expected?
Average downstream bandwidth rsp_size, time bucket, domain Is bandwidth pressure rising during the traffic peak?
HTTP status distribution http_code, time bucket Are 4xx or 5xx responses increasing?
Top URLs by traffic or errors url, rsp_size, http_code Which resource is driving load or failure?
Client and regional distribution client_ip, prov, isp Is the issue concentrated by geography or ISP?

The visual workflow in CLS supports query results, statistic cards, line charts, bar charts, and distribution views. A useful dashboard combines one top-level health view with drill-down panels for cache, bandwidth, status codes, and affected resources.

Incident Runbook

When a CDN traffic peak starts, use this order:

  1. Check p99 request_time, not only average latency.
  2. Compare the current minute error count with the previous minute.
  3. Split errors by host, url, and http_code.
  4. Check hit to understand whether cache behavior changed.
  5. Review rsp_size and traffic volume to separate traffic growth from error growth.
  6. Add client_ip, prov, and isp when the issue may be regional or network-specific.
  7. Turn the final query into an alert if it represents a recurring operational risk.

FAQ

Why not rely only on CDN default metrics?

Default metrics are useful for broad visibility, but log analysis is better when the team needs custom dimensions such as URL, client IP, referer, ISP, cache hit state, or request-level latency.

Why use p99 latency for alerting?

p99 preserves tail latency. Averages can look normal while a meaningful subset of users experiences slow responses.

What fields should be included in alert messages?

Use the affected host, url, client_ip, http_code, and latency metric. These fields help responders judge impact and choose the next query.