点击关注“有赞coder”
获取更多技术干货哦~
背景
理论
实践
-- with cube语法 --
select
dim1, dim2, count(*)
from t1
group by dim1, dim2 with cube;
-- 常规语法 --
select dim1, dim2, count(*) from t1 group by dim1, dim2
union all
select dim1, null, count(*) from t1 group by dim1, null
union all
select null, dim2, count(*) from t1 group by null, dim2
union all
select null, null, count(*) from t1;
一个具有N维的数据模型,做完Cube操作,能产生2N种聚合方式。
withcube
不同的是,该语法对groupBy子句中维度列的顺序敏感,它只返回第一个分组条件指定的列的统计行,改变groupBy列的顺序会改变聚合结果。具体使用方式见代码: -- with rollup语法 --
select
dim1, dim2, count(*)
from t1
group by dim1, dim2 with rollup;
-- 常规语法 --
select dim1, dim2, count(*) from t1 group by dim1, dim2
union all
select dim1, null, count(*) from t1 group by dim1, null
union all
select null, null, count(*) from t1;
一个具有N维的数据模型,做完Rollup操作,能产生N+1种聚合方式。
该语法最为灵活,可以自由配置需要聚合的列,通过维护聚合列组合的配置来完成,强烈推荐使用该方法。比如只需要(dim1), (dim1, dim2) 这两种粒度的汇总,直接配置即可,不需要的聚合粒度无需配置,具体使用方式见代码:
select
dim1, dim2, count(*)
from t1
group by dim1, dim2 grouping sets ( (dim1), (dim1, dim2) )
GROUP BY a, b, c WITH CUBE
is equivalent to
GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (b, c), (a, c), (a), (b), (c), ( ))
GROUP BY a, b, c, WITH ROLLUP
is equivalent to
GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (a), ( ))
select
dim1, dim2, count(*), grouping__id
from t1
group by dim1, dim2 grouping sets ( (dim1, dim2), (dim1), (dim2), () );
hive1.0以后,grouping__id的生成算法与spark一致。在hive1.0以前,生成算法与spark不一致。在生产环境,任务的运行存在降级处理,例如spark引擎执行失败,会尝试降级到mapreduce引擎执行,如果grouping__id的生成算法不一致,对下游从Cube拿汇总结果的任务会产生严重的影响,故在实际生产环境,直接使用该方法的场景较少。
concat_ws
(
':'
, case when dim1 is not null then 'dim1' else null end
, case when dim2 is not null then 'dim2' else null end
) as group_id
1.在使用Cube基于各主题域明细中间层做多维运算之前,务必,务必,务必确保每个维度都做了空处理操作,否则会导致通过上面代码实现的group_id无法区分是数据NULL还是非聚合列产生的NULL,从而导致 无法准确拿到指定聚合粒度的汇总数据。 2.增减维度需要注意维护生成group_id的代码!当新增维度,不能随便位置添加,需要在尾部追加,不能影响已生成的group_id;当减维度,注意要下线使用相关汇总数据的表。
grouping sets
(
(shop_id, goods_id, placed_order_date)
, (shop_id, goods_id, log_src_channel, placed_order_date)
)
现在需要分别拿到 「店铺+商品+日粒度」和「店铺+商品+来源渠道+日粒度」的汇总数据,按照上面代码实现的group_id,现在获取汇总数据的方式见代码:
-- 店铺+商品+日粒度
from
tmp_cube
where
group_id = 'shop_id:goods_id:placed_order_date'
-- 店铺+商品+来源渠道+日粒度
from
tmp_cube
where
group_id = 'shop_id:goods_id:log_src_channel:placed_order_date'
select
, shop_id as shop_id
, goods_id as goods_id
, buyer_id as buyer_id
, log_src_channel as log_src_channel
, substr(placed_order_time, 1, 10) as placed_order_date
, substr(placed_order_time, 12, 2) as placed_order_hour
, case when datediff( '${DP_1_DAYS_AGO_Y_m_d}', substr(placed_order_time,1,10) ) <= 0
and datediff( '${DP_0_DAYS_AGO_Y_m_d}', substr(placed_order_time,1,10) ) > 0
then 1
else 0
end as is_placed_1day -- 是否当日
, case when datediff( '${DP_30_DAYS_AGO_Y_m_d}', substr(placed_order_time,1,10) ) <= 0
and datediff( '${DP_0_DAYS_AGO_Y_m_d}', substr(placed_order_time,1,10) ) > 0
then 1
else 0
end as is_placed_30day -- 是否近30天
, case when datediff( '${DP_1_WEEKS_AGO_MONDAY_Y_m_d}', substr(placed_order_time,1,10) ) <= 0
and datediff( '${DP_0_WEEKS_AGO_MONDAY_Y_m_d}', substr(placed_order_time,1,10) ) > 0
then 1
else 0
end as is_placed_weekly -- 是否当周
, case when datediff( '${DP_FIRST_DAY_PRE_MONTH_Y_m_d}', substr(placed_order_time,1,10) ) <= 0
and datediff( '${DP_FIRST_DAY_THIS_MONTH_Y_m_d}',substr(placed_order_time,1,10) ) > 0
then 1
else 0
end as is_placed_monthly -- 是否当月
, sum( sku_real_pay ) as amt
, count( distinct buyer_id ) as uv
from
tmp_table
where
to_date(placed_order_time) >= '${DP_FIRST_DAY_PRE_MONTH_Y_m_d}'
and
to_date(placed_order_time) < '${DP_0_DAYS_AGO_Y_m_d}'
group by
......
grouping sets (
(shop_id, goods_id, is_placed_1day, placed_order_hour)
, (shop_id, goods_id, is_placed_1day)
, (shop_id, goods_id, is_placed_weekly)
, (shop_id, goods_id, is_placed_monthly)
)
-- 店铺+商品+日粒度
from
tmp_cube
where
group_id = 'shop_id:goods_id:is_placed_1day'
and
is_placed_1day = 1
-- 店铺+商品+周粒度
from
tmp_cube
where
group_id = 'shop_id:goods_id:is_placed_weekly'
and
is_placed_weekly = 1
set hive.new.job.grouping.set.cardinality = 100;
代码开发:代码开发的效率
代码发布:代码发布的便捷性
可维护性:迭代逻辑的成本
单点故障:失败任务的影响面
重刷刷数:补历史的成本
机器成本:性能开销