如何使用RDD API在分区内进行排序(并避免在分区间进行排序)?

编程入门行业动态更新时间:2024-10-26 20:35:19

本文介绍了如何使用RDD API在分区内进行排序(并避免在分区间进行排序)?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

这是Hadoop MapReduce shuffle的默认行为，是对分区内而不是跨分区内的shuffle键进行排序(这是使键跨分区进行排序的总顺序)

It is Hadoop MapReduce shuffle's default behavior to sort the shuffle key within partition, but not cross partitions(It is the total ordering that makes keys sorted cross the parttions)

我会问如何使用Spark RDD(在分区内排序，而不是跨分区排序)实现相同的事情

I would ask how to achieve the same thing using Spark RDD(sort within Partition,but not sort cross the partitions)

RDD的sortByKey方法正在进行总体排序

RDD的repartitionAndSortWithinPartitions在分区中进行排序，但不在跨分区中进行排序，但不幸的是，它增加了一个额外的步骤来进行分区.

RDD's sortByKey method is doing total ordering

RDD's repartitionAndSortWithinPartitions is doing sort within partition but not cross partitions, but unfortunately it adds an extra step to do repartition.

是否有直接的方法可以对分区进行排序，但不能对跨分区进行排序?

Is there a direct way to sort within partition but not cross partitions?

推荐答案

您可以使用Dataset和sortWithinPartitions方法:

import spark.implicits._ sc.parallelize(Seq("e", "d", "f", "b", "c", "a"), 2) .toDF("text") .sortWithinPartitions($"text") .show +----+ |text| +----+ | d| | e| | f| | a| | b| | c| +----+

通常，洗牌是对分区进行排序的重要因素，因为它可以重用洗牌结构来进行排序，而无需立即将所有数据加载到内存中.

In general shuffle is an important factor in sorting partitions because it reuse shuffle structures to sort without loading all data into memory at once.