Why Time-Ordered Indexes Make Log Search Faster Than Plain Lucene

Log data is time-series data with text-search expectations. Every record has a timestamp, and almost every operational query asks for a time range. Plain Lucene is excellent at text search, but an engineering analysis highlights a core mismatch: timestamp range search over high-cardinality values can force the engine to touch a very large number of index terms.

The CLS team introduced a time-series search engine on top of a traditional search-engine foundation. The reported result was major acceleration for large-scale log retrieval: up to 38x for forward search, 24x for reverse search, and 7.6x for histogram search in the reported experiment set, with additional offline and online tests showing larger speedups in some scenarios.

The reported optimization stages can be summarized by the query pattern they target:

Query pattern	Baseline problem	Reported improvement direction
Head query / forward retrieval	Range lookup touches too many timestamp terms before returning earlier records.	Time ordering and supporting indexes reduce unnecessary scanning; the reported acceleration reaches 38x.
Tail query / reverse retrieval	One-way iteration makes reverse access walk toward the tail too slowly.	Reverse binary-search based iteration avoids full traversal; the reported acceleration reaches 24x.
Histogram query	Bucket calculation repeatedly looks up timestamps for matched records.	Bucket boundaries replace per-record timestamp lookup; the reported acceleration reaches 7.6x.

The basic log-search model

A log event may contain a timestamp, text, and attributes:

[2021-09-28 10:10:39T1234] [ip=192.168.1.1]
XXXXXXXX

A log system normally indexes the timestamp, attributes, and tokenized text. A typical query also specifies a time range:

select *
from xxxx_index
where ip = xxxx
  and timestamp >= 2021-09-28
  and timestamp <= 2021-09-29;

Lucene assigns each record a document ID and builds inverted indexes from values to posting lists. For exact lookup, that model is efficient. For timestamp range lookup, the timestamp field can become the problem.

Why high-cardinality timestamp ranges are hard

High cardinality means a field has many distinct values. A millisecond timestamp has 24 * 60 * 60 * 1000 possible values in one day, or 86.4 million possible timestamp terms. At microsecond precision, the theoretical value count grows by another factor of 1000.

At production scale, the cost becomes obvious: if a 10-billion-row log index has roughly 30 GB of timestamp index data, reading that data at 100 MB/s would take about 300 seconds before the search can do useful work.

Search shape	Plain inverted-index pressure
Exact timestamp lookup	Find one posting list quickly.
Time-range lookup	Potentially scan many timestamp terms in the range.
Reverse retrieval by time	May need to reach the tail of ordered data.
Histogram over time	May need timestamp lookups for many matched records.

Core idea: make timestamp order part of the storage layout

The time-series search engine changes the organization of log data by sorting events by timestamp. In the original unordered layout, a time-range query may need to process hundreds of thousands to hundreds of millions of timestamp index terms. With timestamp ordering, the range can be reduced to the range endpoints.

That turns the core range-location task from scanning a large range of timestamp terms into locating two boundaries.

Optimization 1: reduce random disk access in binary search

A binary search over ordered column data is efficient in memory, but it can trigger scattered disk access when the ordered data lives on disk. The time-series search design adds a secondary index so endpoint lookup needs far fewer disk reads. The reported reduction is from dozens of disk accesses to 3.

Optimization 2: support reverse access on top of one-way iterators

The existing lower-level iterators only moved in one direction. In reverse search, timestamp ordering places the newest matching data near the tail, but a one-way iterator would need to traverse all data to get there.

The time-series search design uses a reverse binary-search algorithm to reach tail data quickly. The reported iteration count drops from tens of thousands or hundreds of thousands to dozens, and the complexity changes from O(n) to O(log n * log n).

Optimization 3: calculate histograms with bucket boundaries

Histogram queries are common in log operations. The slow approach is to look up the timestamp for every matched log record. The optimized approach determines the log-ID range for each bucket through bucket boundaries, then assigns internal points by comparing them with the boundaries.

For a histogram over [t0, t1], split the time range into bucket boundaries first. Each boundary lookup returns the corresponding log-ID position, so the engine can map a continuous log-ID interval to a time bucket. Internal records no longer need a separate timestamp lookup one by one; only the boundary records drive the bucket assignment.

Reported performance results

The benchmark includes a read-only offline prototype test over 8 million records at 100 concurrent requests. The new algorithm improved response speed by 50x, from 56.9 seconds to 1.059 seconds. Under the condition that response time stays below 1 second, concurrency improved by 20x, from 4 to more than 90.

The benchmark claims are useful to keep in one place:

Test condition	Baseline	Optimized result	Change
Read-only prototype, 8 million records, 100 concurrent requests	56.9 seconds	1.059 seconds	50x faster response.
Response time kept below 1 second	4 concurrent requests	More than 90 concurrent requests	20x higher concurrency.
Online data with concurrent writes	Core operations affected by IO long-tail jitter	Core operations complete more than 10x faster after IO optimization	More stable online query performance.

The online test also includes concurrent writes. Because distributed writes can introduce 2-3 second long-tail IO jitter, the team optimized IO before comparing online query performance. The reported online result was more than 10x faster across core operations.

Why minute-level indexing is not equivalent

The comparison section highlights a key design tradeoff. A competing log service using minute-level indexing only has 24 * 60 = 1440 index items per day, which avoids the full timestamp-cardinality problem but weakens finer-grained search and sorting. CLS supports microsecond-level indexing; theoretically, one day can contain 24 * 60 * 60 * 1000 * 1000 = 86,400,000,000 timestamp values.

The reported comparison uses a 1-billion-row data scenario and notes two caveats: minute-or-above mostly ordered data is a high-frequency scenario where the difference is smaller, and second-level sorting on the compared service may produce inaccurate results for large datasets.

The comparison can be reduced to the indexing tradeoff:

Design choice	Index item count per day	Consequence
Minute-level timestamp index	`24 * 60 = 1,440` index items	Lower cardinality, but weaker fine-grained search and sorting.
Microsecond-level timestamp index	`24 * 60 * 60 * 1000 * 1000 = 86,400,000,000` possible timestamp values	Much finer precision, but it requires a dedicated time-series index design to stay fast.

Engineering takeaways

Timestamp is not just another numeric field in a log system; it is the dominant access path.
For log search, range queries, reverse queries, and histograms deserve storage-layout support.
Ordered timestamps reduce range search to boundary discovery.
Secondary indexes can make boundary discovery disk-friendly.
Histogram optimization should avoid per-row timestamp lookup when bucket boundaries can identify ranges.
Fine-grained timestamp precision and high query speed require a design that addresses high cardinality directly.

FAQ

Why is Lucene alone not ideal for log timestamp range search?

This design note argues that Lucene is strong for text search but timestamp range queries over high-cardinality values can require scanning many timestamp terms, which becomes expensive at log scale.

What is the central CLS optimization?

CLS orders log data by timestamp and adds implementation work around boundary lookup, reverse traversal, and histogram buckets so common log queries avoid scanning huge timestamp ranges.

Why Time-Ordered Indexes Make Log Search Faster Than Plain Lucene

The basic log-search model

Why high-cardinality timestamp ranges are hard

Core idea: make timestamp order part of the storage layout

Optimization 1: reduce random disk access in binary search

Optimization 2: support reverse access on top of one-way iterators

Optimization 3: calculate histograms with bucket boundaries

Reported performance results

Why minute-level indexing is not equivalent

Engineering takeaways

FAQ

Why is Lucene alone not ideal for log timestamp range search?

What is the central CLS optimization?

Comments

More from this blog

Troubleshoot Kubernetes Events in TKE with Tencent Cloud CLS

Manage Cloud Product Logs from Tencent Cloud Advisor with CLS

Build a Large-Scale Observability Platform on Tencent Cloud CLS: The Beike Case

Detect Malicious IPs in Cloud Access Logs with Tencent Cloud CLS

Deliver CLS Logs to Tencent Cloud DLC for Spark-Based Analysis

Command Palette

The basic log-search model

Why high-cardinality timestamp ranges are hard

Core idea: make timestamp order part of the storage layout

Optimization 1: reduce random disk access in binary search

Optimization 2: support reverse access on top of one-way iterators

Optimization 3: calculate histograms with bucket boundaries

Reported performance results

Why minute-level indexing is not equivalent

Engineering takeaways

FAQ

Why is Lucene alone not ideal for log timestamp range search?

What is the central CLS optimization?

Comments

More from this blog