如何通过蜂巢中的collect

编程入门 行业动态 更新时间:2024-10-19 22:33:17
本文介绍了如何通过蜂巢中的collect_set()操作使用订单的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

在表1中,我有customer_id,item_id和item_rank(根据某些销售情况的项目等级).我想为每个customer_id收集项目列表,并根据item_rank进行排列.

In Table 1, I have customer_id, item_id and item_rank (rank of item according to some sales). I want to collect a list of items for each customer_id and arrange them according to item_rank.

Customer_id item_id rank_item 23 2 3 23 2 3 23 4 2 25 5 1 25 4 2

我期望的输出是

Customer_id item_list 23 4,2 25 5,4

我使用的代码是

SELECT customer_id, concat_ws(',',collect_list (string(item_id))) AS item_list FROM table1 GROUP BY customer_id ORDER BY item_rank

推荐答案

您可以使用子查询来获取结果集("customer_id","item_id","item_rank"),并按item_rank进行排序,然后使用 collect_set 在外部查询中.

You can use a sub-query to get a result set of (customer_id, item_id, item_rank), sorted by item_rank, and then use collect_set in the outer query.

查询

WITH table1 AS ( SELECT 23 AS customer_id, 2 AS item_id, 3 AS item_rank UNION ALL SELECT 23 AS customer_id, 2 AS item_id, 3 AS item_rank UNION ALL SELECT 23 AS customer_id, 4 AS item_id, 2 AS item_rank UNION ALL SELECT 25 AS customer_id, 5 AS item_id, 1 AS item_rank UNION ALL SELECT 25 AS customer_id, 4 AS item_id, 2 AS item_rank ) SELECT subquery.customer_id, collect_set(subquery.item_id) AS item_id_set FROM ( SELECT table1.customer_id, table1.item_id, table1.item_rank FROM table1 DISTRIBUTE BY table1.customer_id SORT BY table1.customer_id, table1.item_rank ) subquery GROUP BY subquery.customer_id ;

结果

customer_id item_id_set 0 23 [4,2] 1 25 [5,4]

子查询使用 DISTRIBUTE BY 来确保特定 customer_id 的所有行都路由到相同的reducer.然后,它使用 SORT BY 在每个化简器中按 customer_id 和 item_rank 进行排序.我希望这足以满足要求,因为我没有注意到对最终结果集进行整体排序的要求.(如果需要按 customer_id 进行总排序,那么我认为查询将不得不使用 ORDER BY ,这将导致执行速度变慢.)

The sub-query uses DISTRIBUTE BY to guarantee that all rows for a particular customer_id route to the same reducer. It then uses SORT BY to sort by customer_id and item_rank within each reducer. I expect this is sufficient for the requirements, because I didn't notice a requirement for total ordering of the final result set. (If total ordering by customer_id is a requirement, then I think the query would have to use ORDER BY, which would cause slower execution.)

内部, collect_set UDAF使用Java LinkedHashSet ,这是一个保留顺序的集合,因此子查询中使用的排序顺序将保留在外部查询的集合中.这在Hive代码库中可见:

Internally, the collect_set UDAF uses a Java LinkedHashSet, which is an order-preserving collection, so the same sort order used in the sub-query will be maintained in the outer query's set. This is visible in the Hive codebase here:

github/apache/hive/blob/release-2.0.0/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMkCollectionEvaluator.java#L93

更多推荐

如何通过蜂巢中的collect

本文发布于:2023-11-22 05:45:18,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1616191.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:蜂巢   collect

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!