Databend官方文档:https://databend.rs/doc/
Databend 是一款开源的数据仓库产品,主要定位于OLAP场景,采用云原生架构理念(可对比snowflake),有非常好的扩展性、同时具备低成本、高性能的优势,兼容MySQL协议。
特点:计算-存储分离架构, 易扩展;parquet列式存储,高压缩;兼容MySQL协议(完善中)
优点:基于云基础设施能力,充分利用了s3对象存储高吞吐,高可靠,低成本的优点,真正实现按用量付费
以腾讯云为例,COS(兼容S3协议)成本远远低于CBS块存储,不到普通云硬盘的1/10,不到云SSD的1/30,且COS本身是多副本
导入数据非常快,count 全表实时返回
高压缩parquet列存格式(因高度重复,数据压缩比惊人,仅供参考)
MySQL [dmall]> insert into t2 select * from t2;
Query OK, 0 rows affected (51.86 sec)
Read 268435456 rows, 8.32 GB in 51.852 sec., 5.18 million rows/sec., 160.48 MB/sec.
Files num: 680
Files size: 57.8M
全表扫描,聚合分析,性能高
MySQL [dmall]> select max(b), count(b), sum(b) from t2;
+--------+-----------+-------------+
| max(b) | count(b) | sum(b) |
+--------+-----------+-------------+
| 100 | 536870912 | 53687091200 |
+--------+-----------+-------------+
1 row in set (21.89 sec)
暂不支持update、delete等DDL操作(开发者确认后续会加上)
MySQL [dmall]> update t2 set b=b+1 limit 1;
ERROR 1105 (HY000): Code: 1005, displayText = sql parser error: Expected an SQL statement, found: update.
json类型(Databend中采用VARIANT关键字,相关函数与MySQL区别较大)
MySQL [dmall]> create table tj(id int, j json);
ERROR 1105 (HY000): Code: 1007, displayText = Unsupported data_type: JSON.
MySQL [dmall]> create table tj(id int, j VARIANT);
Query OK, 0 rows affected (0.02 sec)
Read 0 rows, 0 B in 0.022 sec., 0 rows/sec., 0 B/sec.
MySQL [dmall]> insert into tj values(1, '{"key":"databend-json-tpye-name", "v":"VARIANT"}');
Query OK, 0 rows affected (0.21 sec)
Read 1 rows, 84 B in 0.207 sec., 4.82 rows/sec., 405.05 B/sec.
MySQL [dmall]> select * from tj;
+------+-------------------------------------------------+
| id | j |
+------+-------------------------------------------------+
| 1 | {"key":"databend-json-tpye-name","v":"VARIANT"} |
+------+-------------------------------------------------+
1 row in set (0.12 sec)
MySQL [dmall]> select parse_json(j) from tj;
+-------------------------------------------------+
| parse_json(j) |
+-------------------------------------------------+
| {"key":"databend-json-tpye-name","v":"VARIANT"} |
+-------------------------------------------------+
1 row in set (0.07 sec)
没有索引,没有唯一约束(与OLTP数据库不同,数仓里单表TB或更大是常态,唯一约束会严重限制数据写入性能,因此在Hive或Snowflake中,也没有唯一约束)
1)MySQL中的表结构定义:
MySQL [dmall]> show create table _idb_log\G
*************************** 1. row ***************************
Table: _idb_log
CREATE TABLE `_idb_log` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`app_module` varchar(64) DEFAULT NULL,
`user` varchar(64) NOT NULL DEFAULT 'guest',
`log_level` varchar(16) NOT NULL DEFAULT 'INFO',
`log_message` varchar(20000) DEFAULT NULL,
`created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `idx_app_module` (`app_module`),
KEY `idx_created` (`created`),
KEY `idx_log_level` (`log_level`)
) ENGINE=InnoDB AUTO_INCREMENT=15058436 DEFAULT CHARSET=utf8
2)对应 Databend 中的表结构定义:
MySQL [dmall]> show create table _idb_log\G
*************************** 1. row ***************************
Table: _idb_log
Create Table: CREATE TABLE `_idb_log` (
`id` Int32,
`app_module` String,
`user` String,
`log_level` String,
`log_message` String,
`created` DateTime32,
) ENGINE=FUSE AUTO_INCREMENT='15058436'
1 row in set (0.01 sec)
上手非常简单,最小化部署情况下,对资源要求低,仅需1台机器,启动2个进程即可
databend-meta 进程配置(也可3节点raft集群模式)
# cat etc/databend-meta.toml
log_dir = "/data/databend/_metalogs"
metric_api_address = "0.0.0.0:28100"
admin_api_address = "0.0.0.0:28101"
grpc_api_address = "0.0.0.0:9191"
[raft_config]
id = 1
raft_dir ="/data/databend/_meta1"
raft_api_port = 28103
# Start up mode: single node cluster
single = true
databend-query 进程配置(含 s3存储桶信息)
# cat etc/databend-query-node-1.toml
# Usage:
# databend-query -c databend_query_config_spec.toml
[query]
max_active_sessions = 256
wait_timeout_mills = 5000
# For flight rpc.
flight_api_address = "0.0.0.0:9091"
# Databend Query http address.
# For admin RESET API.
http_api_address = "0.0.0.0:8081"
# Databend Query metrics RESET API.
metric_api_address = "0.0.0.0:7071"
# Databend Query MySQL Handler.
mysql_handler_host = "0.0.0.0"
mysql_handler_port = 3307
# Databend Query ClickHouse Handler.
clickhouse_handler_host = "0.0.0.0"
clickhouse_handler_port = 9001
# Databend Query HTTP Handler.
http_handler_host = "0.0.0.0"
http_handler_port = 8000
tenant_id = "test_tenant"
cluster_id = "test_cluster"
table_engine_memory_enabled = true
table_engine_csv_enabled = true
table_engine_parquet_enabled = true
database_engine_github_enabled = true
table_cache_enabled = true
table_memory_cache_mb_size = 1024
table_disk_cache_root = "_cache"
table_disk_cache_mb_size = 10240
[log]
log_level = "ERROR"
log_dir = "/data/databend/_qlogs"
[meta]
# To enable embedded meta-store, set meta_address to ""
meta_embedded_dir = "/data/databend/_meta_embedded_1"
meta_address = "0.0.0.0:9191"
meta_username = "root"
meta_password = "xxx"
meta_client_timeout_in_second = 60
# Storage config.
[storage]
# disk|s3
storage_type = "s3"
# DISK storage.
[storage.disk]
data_path = "_data"
# S3 storage. If you want you s3 ,please storage type : s3
[storage.s3]
root="bucket-start-path"
bucket="cos-bucket-name"
region="ap-beijing"
endpoint_url="https://cos.ap-beijing.myqcloud.com"
access_key_id="xx"
secret_access_key="xx"
# Azure storage
[storage.azure_storage_blob]
启动进程
# cat start.sh
ulimit -n 65535
nohup ./databend-meta --config-file=./etc/databend-meta.toml 2>&1 >meta.log &
sleep 3
nohup ./databend-query --config-file=./etc/databend-query-node-1.toml 2>&1 >query.log &
echo "Please usage: mysql -h127.0.0.1 -P3307 -uroot"
跑上面5亿条数据全表扫描聚合的Query,内存占用很低,CPU跑满了(2C.8G低配CVM)
云原生数仓,理念比较先进,是最近非常热门的技术趋势;基于s3对象存储,成本极低,按量付费,可靠性还高,可充分利用云基础设施的能力,支持更高吞吐更高并发的OLAP场景;
Databend 架构清晰,起步资源要求低,部署配置上手非常容易,初步体验后,对其性能表现印象很深刻;
在行业数字化时代,会更快更多地产生数据,会要求更高效的查询分析,同时还要能有效控制成本,看好Databend的产品定位和未来发展。