For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Sign inTry it free
DocsGuidesSDKsIntegrationsAPI docsTutorialsFlagship blog
DocsGuidesSDKsIntegrationsAPI docsTutorialsFlagship blog
  • Flagship blog
    • 52 Blog Posts, Claude, 3 Prompts, Under an Hour
    • Shipping from Oakland: An Observability Hackathon Recap
    • Day 12 | New Year, New Observability
    • Day 11 | What engineering teams really want from Observability
    • Day 10 | Why observability and feature flags go together like milk and cookies
    • Day 9 | The Three Ghosts Haunting Your AI This Holiday Season
    • Day 8 | Observable Multi-Modal Agentic Systems
    • Day 7 | SLOs that actually drive decisions
    • Day 6 | Stop cardinality from stealing your cloud budget
    • Day 5 | Using a Popular Tidying Method to Consolidate Your Observability Stack
    • Day 4 | Tracing the impact of feature flags in your Node.js app
    • Day 3 | Zero-Config Observability with OpenTelemetry
    • Day 2 | Why AI agents need three layers of observability
    • Day 1 | Observability Under the Tree: What Changed in 2025
    • 5 takeaways from my first PyCon JP conference
    • Dungeons & Downtimes: XP gained from our adventure
    • Reverse Proxy for custom domains
    • Adventures in dogfooding: Guarded Releases
    • A quick tool for npm package scanning
    • My DEF CON 33 experience
    • Make every launch a big deal
    • Fun with JS streams
    • Moonshots XXII: Hack to the Future recap
    • A tale of three rate limiters
    • My good friend Claude
    • My approach to React app architecture in 2025
    • Data isolation with ClickHouse row policies
    • Ingest and Visualization for OpenTelemetry Metrics
    • Alert Evaluations: Incremental Merges in ClickHouse
    • Optimizing ClickHouse: The Tactics That Worked for Us
    • Migrating from OpenSearch to ClickHouse
    • Revamping Privacy Mode: A Better Way to Obfuscate Sensitive Data
    • An open-source session replay benchmark
    • LLM-based Grouping of Errors
    • Building GitHub Enhanced Stacktraces
    • Vercel Edge Runtime Support
    • Finding Interesting Sessions with Markov Chains
    • Building Logging Integrations at LaunchDarkly
    • The Network Request Details Panel
    • Using Github as a Headless CMS
    • Your Source Maps Should Be Public
    • Supporting Outside Contributions at LaunchDarkly
    • Managing our design tokens at LaunchDarkly
    • Our Commitment to OpenTelemetry
    • The 5 Best Logging Libraries for Ruby
    • InfluxDB: Visualizing Millions of Customers' Metrics using a Time Series Database
    • 8 Tips to Help You Maximize Chrome DevTools
    • The Debugging Process and Techniques for Web Applications (Part 2/2)
    • 5 Best Node.js Logging Libraries
    • What are rage clicks and how to detect them
    • 5 Best Practices for Maintaining a Clean ReactJS App
    • Is Kafka the Key? The Evolution of LaunchDarkly's Ingest
    • What Is Full Stack Monitoring and How Does It Work?
    • The beauty of contact-first API design
    • What is Frontend Monitoring and What Tools Help You Do It?
    • 5 strategies to monitor the health of your web application
    • Configuring OpenSearch for a Write-Heavy Workload
    • Maximizing Our Machines: Worker Pools At LaunchDarkly
Sign inTry it free
LogoLogo
On this page
  • Migrating from OpenSearch to ClickHouse
  • ClickHouse Advantages
  • Data Modeling
  • Keeping Data in Sync
  • Results
Flagship blog

Migrating from OpenSearch to ClickHouse

Was this page helpful?
Previous

Revamping Privacy Mode: A Better Way to Obfuscate Sensitive Data

Next
Built with

Published October 18, 2023

portrait of Zane Mayberry.

by Zane Mayberry

Migrating from OpenSearch to ClickHouse

In LaunchDarkly, users can search for sessions and errors based on their metadata - this includes user identifiers, visited URLs, custom events, and many other properties. LaunchDarkly’s search allows users to combine as many of these filters as needed to narrow down the sessions or errors they’re looking for. LaunchDarkly also supports different operator types, such as substring search, regex matching, inverting the results with the “not” operator, and numeric comparators.

LaunchDarkly’s primary data store is Postgres. When it comes to creating a session or error for the first time, or querying / updating its metadata by unique id, Postgres is performant enough. However, it is not as well suited to querying for records based on arbitrary attributes. Because LaunchDarkly’s search allows users to pick which fields and operators to use in their search, it wouldn’t be possible to design a Postgres index to generically speed up all queries, and as a result, searching in Postgres would require a full scan over all sessions / errors within a given time range. As some workspaces can ingest millions of sessions or errors per week, a worst case query that matches very few of these records could require scanning millions of rows before returning a result, which would be prohibitively slow.

We originally used OpenSearch as a solution to this. OpenSearch stores each field in an inverted index, so that a query can find all matching documents within each index and intersect the results. This works pretty well on exact keyword searches and was originally performant enough for the more complicated substring and regex searches, especially compared to full table scans in Postgres. However, as our customer base has grown, the time taken for certain searches increased dramatically. In our own workspace, substring searches such as sessions with a URL containing “app.highlight.io” were taking more than 15 seconds. Additionally, OpenSearch performance is dependent on the amount of memory available to the cluster, and supporting this was costly.

ClickHouse Advantages

ClickHouse is a column oriented database. A typical LaunchDarkly workspace may use 50 session field types, but if a query uses one or two of these to filter, ClickHouse will only load the relevant columns from storage. ClickHouse uses a sparse primary index to filter out large blocks of irrelevant records, and is optimized for quickly scanning through the remaining records. This aligns with our session and error searches, which enforce a date filter, sort the results by date, and use complicated queries like regex matching which cannot (in general) be improved by using an index.

ClickHouse also supports materialized views, which behave like write triggers in a traditional database. They can be used to aggregate data to speed up query performance. For our use case, we found these helpful for field autocomplete suggestions. In our original OpenSearch implementation, we recorded the first time a field was used, but nothing else. With ClickHouse, we could keep track of the number of times a field was used per day, then when suggesting fields, we could filter for only fields that exist in the search range and order them by decreasing frequency to improve our suggestions’ relevance.

We had already adopted ClickHouse for log storage and it seemed like the best candidate for an OpenSearch replacement. After a POC to test search performance in our own workspace, we decided to migrate all sessions and errors from OpenSearch to ClickHouse.

Data Modeling

There are some unique requirements with ClickHouse when compared to other databases. One of the big differences is that mutations are expensive. In our data model, when a session is “live”, session metadata can be updated many times as new events are recorded. To handle this in ClickHouse, we chose to use a ReplacingMergeTree and rewrite all session metadata whenever an update happens. At query time, we can use the FINAL keyword to ensure we are viewing the latest metadata.

Unlike most relational databases, in ClickHouse there’s a big difference between a table’s sparse primary index and secondary (data skipping) indexes. In order to get the best performance, queries should be able to use the primary index to filter out as many rows as possible. In LaunchDarkly’s searches, start and end dates are required input, and results are sorted based on their date, which intuitively translates to using the session / error date as a primary key (plus ID as a tiebreak). Searches for small time ranges will be very fast as results outside the time range can be filtered out quickly compared to the time it would take to scan over them.

One of OpenSearch’s biggest advantages is its ability to handle semi-structured data. In LaunchDarkly, users can arbitrarily add custom metadata to their sessions. In OpenSearch, we were able to add these as an array on each session, and querying on these was about as fast as querying based on a top-level field. This didn’t translate as well to ClickHouse. Using the Map data type, we could store metadata key-value pairs in a new column, but querying was slow for workspaces with lots of custom metadata, because loading the values for a single key requires loading the entire Map column, even though most keys aren’t needed. To get around this, we used a separate table and added the custom metadata key to the primary key. When a query uses this custom metadata, we can filter for only this key in a subquery, resulting in performance near what it would be if these keys were individual columns. One downside with this implementation is that the query requires all matching fields to be scanned before filtering out results from the sessions table. At our current scale, this has been performant enough, but if we have to revisit this in the future, ClickHouse supports alternate join algorithms like full sorting merge join which could avoid this.

Keeping Data in Sync

One of the problems to solve with any secondary data store is how to keep the information up-to-date with our Postgres source of truth. In our original implementation, we dual-wrote to OpenSearch whenever any session or error metadata was added or updated in LaunchDarkly. Because the Postgres update would happen before the OpenSearch update, in the case of a failure, there was a chance that data could be written to Postgres but not OpenSearch. To address this, we logged the ids of any failed sessions or errors so that we could retry these later.

This original approach had some issues - if a field was updated in more than one place concurrently, there was no guarantee that the latest update sent to OpenSearch matched the latest update in Postgres. To avoid race conditions, we would have needed to load the latest state of each session whenever it is updated, or maintain a sequence number in Postgres in order to reject any out-of-order OpenSearch updates.

We already use Kafka for events ingest, so we chose to piggyback on our existing infrastructure by creating a new Kafka topic to handle data syncing. When session or error metadata is updated, we write the id of that session or error to this new topic. Then we have a separate worker reading from this topic that loads a batch of these records in one bulk query from Postgres and writes one bulk insert to ClickHouse. With this architecture, there may be a delay between the Postgres and ClickHouse updates, but we’re guaranteed eventual consistency and able to retry in case of any failures.

Results

We had a lot of success with our new ClickHouse-based search backend. In our own workspace, complex queries (e.g. “visited URL contains ‘app.highlight.io’”) decreased from 15 seconds to less than 1 second. Using ClickHouse’s materialized views, we were able to improve the relevance of field suggestions by filtering out fields that haven’t been used in the time range and ordering them by frequency. Finally, we were able to decrease our monthly spend and simplify our architecture by terminating our OpenSearch cluster.