据我了解,
-
按照简化顺序排序
-
按全局顺序排列事物,但将所有内容都集中到一个reducer中。
-
并按照
进行排序。所以我的问题是通过保证全局顺序来确保集群吗?通过将相同的密钥分配到相同的缩减器中,但相邻的密钥又如何?
我能找到的唯一文件是这里,从这个例子看来,它似乎在全球订购它们。但从定义上来说,我觉得它并不总是这样。 简短回答:是, CLUSTER BY 保证全局排序,假设您愿意自己加入多个输出文件。
更长的版本:
- ORDER BY x :保证全局排序,但是通过只将一个数据减速器。对于大型数据集来说,这基本上是不可接受的。您最终得到一个排序文件作为输出。
- SORT BY x :在N个缩减器中的每一个处订购数据,但每个缩减器都可以接收重叠的数据范围。您最终会得到N个或多个重叠范围的排序文件。 范围 x ,但不排序每个缩减器的输出。最终得到N个或未排序的文件,其中包含非重叠范围。
- CLUSTER BY x 重叠范围,然后在减速器中按这些范围进行排序。这为您提供全局排序,与执行( DISTRIBUTE BY x 和 SORT BY x )相同。您最终会得到N个或更多的非重叠范围的排序文件。
有意义吗?因此 CLUSTER BY 基本上是 ORDER BY 的可扩展版本。
As far as I understand;
sort by only sorts with in the reducer
order by orders things globally but shoves everything into one reducers
cluster by intelligently distributes stuff into reducers by the key hash and make a sort by
So my question is does cluster by guarantee a global order? distribute by puts the same keys into same reducers but what about the adjacent keys?
The only document I can find on this is here and from the example it seems like it orders them globally. But from the definition I feel like it doesn't always do that.
解决方案A shorter answer: yes, CLUSTER BY guarantees global ordering, provided you're willing to join the multiple output files yourself.
The longer version:
- ORDER BY x: guarantees global ordering, but does this by pushing all data through just one reducer. This is basically unacceptable for large datasets. You end up one sorted file as output.
- SORT BY x: orders data at each of N reducers, but each reducer can receive overlapping ranges of data. You end up with N or more sorted files with overlapping ranges.
- DISTRIBUTE BY x: ensures each of N reducers gets non-overlapping ranges of x, but doesn't sort the output of each reducer. You end up with N or unsorted files with non-overlapping ranges.
- CLUSTER BY x: ensures each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers. This gives you global ordering, and is the same as doing (DISTRIBUTE BY x and SORT BY x). You end up with N or more sorted files with non-overlapping ranges.
Make sense? So CLUSTER BY is basically the more scalable version of ORDER BY.
更多推荐
Hive集群通过vs排序通过vs排序
发布评论