ClickHouse is an open-source, column-oriented database for online analytical processing. One of ClickHouse’s standout factors is its high performance—due to a combination of factors such as column-based data storage & processing, data compression, and indexing.
Initial Use Case
In 2020, while the data platform team was managing Druid, the marketplace team considered a new set of requirements:
- Data produced is immediately available for querying in near real-time
- Latencies are sub-second for business dashboarding
- Ingestion for quick slice and dice of datasets. (For example: How many rides in the last 2 hours in the SF region?)
- Nested data support
- Support for both real-time and batch ingestion
- Native data deduplication at destination
While the latest version of Druid would provide us with some of these features, such as nested joins (v0.18), other requirements such as deduplication at destination would not be well satisfied. Using our existing stack, we considered performing deduplication at the streaming layer instead of at the destination.
However, two main reasons prevented us from pursuing this idea:
- We would want to perform this at the Destination Storage layer to deduplicate data between the stream and batch loads.
- Streaming solutions require setting up a mutability window per entity (ex. 24 hours per ride). This was a hard requirement from the business end due to possible scenarios of updating a past transactional entity already written to storage. This was coupled with the need of the entity to be queryable as soon as possible (at the end of a ride, for example, if not earlier).