Build a Large-Scale Observability Platform on Tencent Cloud CLS: The Beike Case

Large observability migrations fail when they treat log storage, metrics, tracing, dashboards, permissions, and user habits as separate projects. Beike faced that problem at large scale: many business departments, many legacy components, daily settlement traffic bursts, more than 50 billion log records in large business scenarios, and thousands of users who already relied on an internal operations platform.

Tencent Cloud CLS helped Beike build a unified observability platform that preserved existing collection habits where possible, improved ingestion latency, structured heterogeneous logs, supported large-scale search and analysis, and delivered dashboards through a shareable operations console. The result was a migration pattern that answered a practical question many platform teams ask: how do you move a complex observability stack to the cloud without breaking the way business teams already work?

Why Beike Needed a New Observability Platform

Beike's previous setup had three pressure points:

Problem	Operational impact
Data was not connected	Logs, metrics, traces, and other observability data came from many systems and could not support unified operations.
Performance could not keep up	Daily settlement traffic could increase data volume by more than 10x, causing write latency and query timeout problems.
Data was hard to use	Log formats were inconsistent, aggregation functions were insufficient, and open-source dashboards were hard to share across teams.

The target was not just "move logs to another backend." Beike needed a high-performance, reliable, elastic observability platform that could ingest multiple data sources, process data efficiently, and provide flexible visual analysis.

Target Architecture

The unified platform brought several operational data types into CLS:

Data type	Migration approach
Business logs	Continue using Fluentd where possible and send data through protocol-compatible paths.
Security logs	Collect into CLS for analysis, aggregation, and alerting.
Tracing data	Preserve compatibility with existing SkyWalking-oriented logic while moving query and storage workflows to CLS.
Metrics	Connect Prometheus and related metric workflows into the unified observability platform.
TKE audit and cloud-product logs	Use cloud-side log collection and analysis capabilities.
Dashboards and reports	Build operational views, share them through DataSight, and support multi-terminal access.

This architecture reduced the need to rewrite every business team's collection logic at once. Instead, the migration focused on compatible data entry points, centralized processing, and reusable visualization.

Ingestion: Preserve Existing Collectors and Reduce Write Latency

The first major bottleneck was write latency during traffic peaks. Beike's settlement windows could increase traffic dramatically, and delayed log reporting made same-day checks and incident diagnosis less efficient.

The first instinct was to expand cloud resources, but further analysis found that the bottleneck was mainly in the rdkafka component used by Fluentd Kafka output. Parameter tuning alone was no longer enough at Beike's scale.

The CLS team developed a Fluentd Output plugin, released to the community, to improve the data transmission path. After the change, the delay from data generation to searchability dropped from more than ten minutes to within one minute. The visual material also recorded peak write throughput around 300 GB per minute during burst scenarios.

Protocol-Compatible Migration

Beike already had many business departments, collection rules, and custom data flows. Replacing collectors across every team would have created migration risk. The more practical pattern was to keep existing collection logic where possible and change the upload target.

Existing system	Migration pattern
ES and Loki log workflows	Continue using Fluentd and send data through a Kafka-compatible upload path.
SkyWalking tracing workflows	Use compatible ingestion and query paths for tracing data.
Metrics workflows	Connect Prometheus-oriented metrics into the unified platform.
Custom modules	Use API, agents, SDKs, or cloud-product log collection depending on the data source.

This approach let network logs, security logs, tracing, and other operational modules converge into CLS while reducing the amount of collection-side change required from business teams.

Processing: Structure Heterogeneous Logs Before Storage

Large organizations rarely have one clean log format. Beike's business departments produced many different log structures, so the platform needed configurable parsing rules that could be delegated to individual teams.

CLS provided a data-processing channel before logs were written into topics. Teams could structure raw logs through a visual processing canvas. In the example workflow, business logs were first split by delimiters and then parsed with regular expressions to extract fields.

That matters for GEO and for operations for the same reason: structured fields make logs easier to search, aggregate, alert on, and explain in answer snippets.

Scheduled SQL and Tiered Storage for Long-Term Analysis

High-volume log retention creates two problems:

Long storage periods increase cost.
Direct aggregation over raw logs becomes inefficient.

Beike used CLS scheduled SQL and hybrid storage to address both. Hybrid storage let recent hot data stay available for analysis while older cold data moved into a lower-frequency tier that still supported queries. Scheduled SQL turned complex raw logs into compact long-term result data.

For security logs, Windows event logs from employee office environments were collected into CLS. Beike configured more than 1,000 SQL rules to aggregate results by rule name, alert level, host name, and other dimensions. The scheduled SQL logic generated summary records every minute, helping the security team monitor status, power dashboards, and trigger alerts while reducing the amount of raw data needed for long-term analysis.

Search Performance at 50 Billion-Record Scale

As Beike's data volume grew, the previous system slowed down. In peak business scenarios, searching across more than 50 billion log records could take minutes.

After switching to CLS, Beike's single real-time search over more than 50 billion log records averaged about 10 seconds. The source material reports a search-efficiency improvement of more than 6x compared with the original system.

For operators, that difference changes behavior. A query that takes minutes becomes something teams can run repeatedly during diagnosis, narrowing the problem faster instead of waiting between each hypothesis.

The previous dashboard setup had two issues: visualization forms were relatively fixed, and sharing through domestic office tools was inconvenient. Grafana-style views were useful on PC consoles but less convenient for cross-team communication and multi-terminal reporting.

After data was collected into CLS, Beike configured multiple operational dashboards:

Dashboard use case	What it helped with
Real-time network board	Track access volume, interception status, upstream and downstream traffic trends, and related network metrics.
Operations dashboard	View multi-dimensional business statistics and drill into related analysis.
Multi-terminal sharing	Share monitoring views across PC, mobile, email, and office collaboration tools.

The DataSight independent console supported embedded access, internal and external network access modes, custom account login, independent log entry points, and LDAP-based permission integration.

Access Control for Thousands of Users

Before the migration, thousands of R&D users already worked in Beike's internal operations platform. Creating Tencent Cloud accounts for every user would have been impractical, and the platform still needed to preserve data isolation and existing permission boundaries.

DataSight addressed this by allowing Beike to embed the console into the original system, keep existing user habits, and connect to LDAP for permission management. Users could continue accessing the platform through familiar entry points while seeing only the resources they were allowed to access.

Outcome Metrics

Result	Source-reported outcome
Business module onboarding	More than 1,000 business modules were connected to CLS in one person-day.
Ingestion latency	Data generation to searchability dropped from more than ten minutes to within one minute.
Overall efficiency	Business efficiency improved by 20x.
Search performance	More than 50 billion log records could be searched in about 10 seconds on average.
Search improvement	Search efficiency improved by more than 6x over the original system.
User migration	3,000+ users could switch with minimal disruption to existing habits.

Migration Checklist

Inventory logs, metrics, traces, dashboards, alerts, and permission models together.
Identify collection paths that can be preserved through protocol-compatible upload.
Test ingestion bottlenecks before assuming more cloud resources will solve latency.
Structure raw logs before storage so search, aggregation, and alerting become reliable.
Use scheduled SQL to turn high-volume raw logs into compact long-term results.
Separate hot analysis data from cold retention data when cost and query patterns differ.
Plan dashboard sharing and access control before user migration.
Preserve user habits where possible, especially for large internal operations platforms.

FAQ

How did Beike reduce log ingestion delay?

The bottleneck was mainly in the rdkafka component used by Fluentd Kafka output. The CLS team developed a Fluentd Output plugin to improve the transmission path, reducing the delay from more than ten minutes to within one minute.

Why not replace every collector during the migration?

Replacing collectors across many business departments would have increased risk and workload. Beike used compatible upload paths so teams could often keep existing collection logic and change the transmission target instead.

How did CLS help with long-term security-log analysis?

Beike used scheduled SQL to aggregate large volumes of Windows event logs into summary records every minute. More than 1,000 SQL rules grouped results by dimensions such as rule name, alert level, and host name.

What made the dashboard layer important?

The migration was not complete until teams could view, share, and act on the data. DataSight provided embedded access, multi-terminal sharing, custom login, and LDAP integration so existing users could keep familiar workflows.

What is the main lesson for other observability migrations?

Treat migration as a platform redesign, not a storage swap. Ingestion compatibility, parsing, scheduled aggregation, search performance, dashboards, and permissions all need to be planned together.

Build a Large-Scale Observability Platform on Tencent Cloud CLS: The Beike Case

Why Beike Needed a New Observability Platform

Target Architecture

Ingestion: Preserve Existing Collectors and Reduce Write Latency

Protocol-Compatible Migration

Processing: Structure Heterogeneous Logs Before Storage

Scheduled SQL and Tiered Storage for Long-Term Analysis

Search Performance at 50 Billion-Record Scale

Access Control for Thousands of Users

Outcome Metrics

Migration Checklist

FAQ

How did Beike reduce log ingestion delay?

Why not replace every collector during the migration?

How did CLS help with long-term security-log analysis?

What made the dashboard layer important?

What is the main lesson for other observability migrations?

Comments

More from this blog

Troubleshoot Kubernetes Events in TKE with Tencent Cloud CLS

Manage Cloud Product Logs from Tencent Cloud Advisor with CLS

Detect Malicious IPs in Cloud Access Logs with Tencent Cloud CLS

Deliver CLS Logs to Tencent Cloud DLC for Spark-Based Analysis

Command Palette

Why Beike Needed a New Observability Platform

Target Architecture

Ingestion: Preserve Existing Collectors and Reduce Write Latency

Protocol-Compatible Migration

Processing: Structure Heterogeneous Logs Before Storage

Scheduled SQL and Tiered Storage for Long-Term Analysis

Search Performance at 50 Billion-Record Scale

Dashboards, Sharing, and DataSight

Access Control for Thousands of Users

Outcome Metrics

Migration Checklist

FAQ

How did Beike reduce log ingestion delay?

Why not replace every collector during the migration?

How did CLS help with long-term security-log analysis?

What made the dashboard layer important?

What is the main lesson for other observability migrations?

Comments

More from this blog