ClickHouse is an open-source, column-oriented database for online analytical processing. One of ClickHouse’s standout factors is its high performance—due to a combination of factors such as column-based data storage & processing, data compression, and indexing.


Initial Use Case


In 2020, while the data platform team was managing Druid, the marketplace team considered a new set of requirements:


  1. Data produced is immediately available for querying in near real-time
  2. 生成的数据可以立即进行近实时查询
  3. Latencies are sub-second for business dashboarding
  4. 业务仪表盘的延迟在亚秒级别
  5. Ingestion for quick slice and dice of datasets. (For example: How many rides in the last 2 hours in the SF region?)
  6. 用于快速切片和切割数据集的摄取。(例如:在SF地区过去2小时内有多少次乘车?)
  7. Nested data support
  8. 嵌套数据支持
  9. Support for both real-time and batch ingestion
  10. 支持实时和批量摄取
  11. Native data deduplication at destination
  12. 目标地点的本地数据去重

While the latest version of Druid would provide us with some of these features, such as nested joins (v0.18), other requirements such as deduplication at destination would not be well satisfied. Using our existing stack, we considered performing deduplication at the streaming layer instead of at the destination.


However, two main reasons prevented us from pursuing this idea:


  1. We would want to perform this at the Destination Storage layer to deduplicate data between the stream and batch loads.
  2. 我们希望在目标存储层执行此操作,以在流式加载和批量加载之间去重数据。
  3. Streaming solutions require setting up a mutability window per entity (ex. 24 hours per ride). This was a hard requirement from the business end due to possible scenarios of updating a past transactional entity already written to storage. This was coupled with the need of the entity to be queryable as soon as possible (at the end of a ride, for example, if not earlier).
  4. 流式解决方案需要为每个实体设置一个可变窗口(例如每次乘车24小时)。这是业务端的一个硬性要求,因为可能会出现更新...

首页 - Wiki
Copyright © 2011-2024 iteam. Current version is 2.123.1. UTC+08:00, 2024-02-21 15:48
浙ICP备14020137号-1 $访客地图$