Lyft的德鲁伊弃用和ClickHouse采用
ClickHouse is an open-source, column-oriented database for online analytical processing. One of ClickHouse’s standout factors is its high performance—due to a combination of factors such as column-based data storage & processing, data compression, and indexing.
ClickHouse是一个开源的面向列的数据库,用于在线分析处理。ClickHouse的一个显著因素是其高性能,这归功于列式数据存储和处理、数据压缩和索引等多种因素的结合。
Initial Use Case
初始用例
In 2020, while the data platform team was managing Druid, the marketplace team considered a new set of requirements:
2020年,在数据平台团队管理Druid的同时,市场团队考虑了一组新的需求:
- Data produced is immediately available for querying in near real-time
- 生成的数据可以立即进行近实时查询
- Latencies are sub-second for business dashboarding
- 业务仪表盘的延迟在亚秒级别
- Ingestion for quick slice and dice of datasets. (For example: How many rides in the last 2 hours in the SF region?)
- 用于快速切片和切割数据集的摄取。(例如:在SF地区过去2小时内有多少次乘车?)
- Nested data support
- 嵌套数据支持
- Support for both real-time and batch ingestion
- 支持实时和批量摄取
- Native data deduplication at destination
- 目标地点的本地数据去重
While the latest version of Druid would provide us with some of these features, such as nested joins (v0.18), other requirements such as deduplication at destination would not be well satisfied. Using our existing stack, we considered performing deduplication at the streaming layer instead of at the destination.
虽然最新版本的Druid会为我们提供一些这些功能,例如嵌套连接(v0.18),但目的地的去重等要求可能无法很好地满足。使用我们现有的堆栈,我们考虑在流式处理层面上执行去重,而不是在目的地上执行去重。
However, two main reasons prevented us from pursuing this idea:
然而,有两个主要原因阻止我们追求这个想法:
- We would want to perform this at the Destination Storage layer to deduplicate data between the stream and batch loads.
- 我们希望在目标存储层执行此操作,以在流式加载和批量加载之间去重数据。
- Streaming solutions require setting up a mutability window per entity (ex. 24 hours per ride). This was a hard requirement from the business end due to possible scenarios of updating a past transactional entity already written to storage. This was coupled with the need of the entity to be queryable as soon as possible (at the end of a ride, for example, if not earlier).
- 流式解决方案需要为每个实体设置一个可变窗口(例如每次乘车24小时)。这是业务端的一个硬性要求,因为可能会出现更新...