Hive集群通过vs排序通过vs排序

编程入门 行业动态 更新时间:2024-10-10 19:20:31
本文介绍了Hive集群通过vs排序通过vs排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

据我了解,

  • 按照简化顺序排序

  • 按全局顺序排列事物,但将所有内容都集中到一个reducer中。

  • 并按照

进行排序。所以我的问题是通过保证全局顺序来确保集群吗?通过将相同的密钥分配到相同的缩减器中,但相邻的密钥又如何?

我能找到的唯一文件是这里,从这个例子看来,它似乎在全球订购它们。但从定义上来说,我觉得它并不总是这样。 简短回答:是, CLUSTER BY 保证全局排序,假设您愿意自己加入多个输出文件。

更长的版本:

  • ORDER BY x :保证全局排序,但是通过只将一个数据减速器。对于大型数据集来说,这基本上是不可接受的。您最终得到一个排序文件作为输出。
  • SORT BY x :在N个缩减器中的每一个处订购数据,但每个缩减器都可以接收重叠的数据范围。您最终会得到N个或多个重叠范围的排序文件。
  • 范围 x ,但不排序每个缩减器的输出。最终得到N个或未排序的文件,其中包含非重叠范围。
  • CLUSTER BY x 重叠范围,然后在减速器中按这些范围进行排序。这为您提供全局排序,与执行( DISTRIBUTE BY x 和 SORT BY x )相同。您最终会得到N个或更多的非重叠范围的排序文件。

有意义吗?因此 CLUSTER BY 基本上是 ORDER BY 的可扩展版本。

As far as I understand;

  • sort by only sorts with in the reducer

  • order by orders things globally but shoves everything into one reducers

  • cluster by intelligently distributes stuff into reducers by the key hash and make a sort by

So my question is does cluster by guarantee a global order? distribute by puts the same keys into same reducers but what about the adjacent keys?

The only document I can find on this is here and from the example it seems like it orders them globally. But from the definition I feel like it doesn't always do that.

解决方案

A shorter answer: yes, CLUSTER BY guarantees global ordering, provided you're willing to join the multiple output files yourself.

The longer version:

  • ORDER BY x: guarantees global ordering, but does this by pushing all data through just one reducer. This is basically unacceptable for large datasets. You end up one sorted file as output.
  • SORT BY x: orders data at each of N reducers, but each reducer can receive overlapping ranges of data. You end up with N or more sorted files with overlapping ranges.
  • DISTRIBUTE BY x: ensures each of N reducers gets non-overlapping ranges of x, but doesn't sort the output of each reducer. You end up with N or unsorted files with non-overlapping ranges.
  • CLUSTER BY x: ensures each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers. This gives you global ordering, and is the same as doing (DISTRIBUTE BY x and SORT BY x). You end up with N or more sorted files with non-overlapping ranges.

Make sense? So CLUSTER BY is basically the more scalable version of ORDER BY.

更多推荐

Hive集群通过vs排序通过vs排序

本文发布于:2023-11-25 10:24:38,感谢您对本站的认可!
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:集群   Hive

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!