这是Hadoop MapReduce shuffle的默认行为,是对分区内而不是跨分区内的shuffle键进行排序(这是使键跨分区进行排序的总顺序)
It is Hadoop MapReduce shuffle's default behavior to sort the shuffle key within partition, but not cross partitions(It is the total ordering that makes keys sorted cross the parttions)
我会问如何使用Spark RDD(在分区内排序,而不是跨分区排序)实现相同的事情
I would ask how to achieve the same thing using Spark RDD(sort within Partition,but not sort cross the partitions)
是否有直接的方法可以对分区进行排序,但不能对跨分区进行排序?
Is there a direct way to sort within partition but not cross partitions?
推荐答案您可以使用Dataset和sortWithinPartitions方法:
import spark.implicits._ sc.parallelize(Seq("e", "d", "f", "b", "c", "a"), 2) .toDF("text") .sortWithinPartitions($"text") .show +----+ |text| +----+ | d| | e| | f| | a| | b| | c| +----+通常,洗牌是对分区进行排序的重要因素,因为它可以重用洗牌结构来进行排序,而无需立即将所有数据加载到内存中.
In general shuffle is an important factor in sorting partitions because it reuse shuffle structures to sort without loading all data into memory at once.
更多推荐
如何使用RDD API在分区内进行排序(并避免在分区间进行排序)?
发布评论