问题是我有以下DAG:
The question is I have the following DAG:
我认为,当需要改组时,spark会在不同阶段分配工作.考虑阶段0和阶段1.有些操作不需要改组.那么,为什么Spark将它们分成不同的阶段?
I thought that spark devides a job in different stages when shuffling is required. Consider Stage 0 and Stage 1. There are operation which do not require shuffling. So why does Spark split them into different stages?
我认为跨分区的实际数据迁移应该已经在阶段2进行了.因为在这里我们需要 cogroup .但是要进行分组,我们需要来自 stage 0 和 stage 1 的数据.
I thought that actual moving of data across partitions should have happened at Stage 2. Because here we need to cogroup. But to cogroup we need data from stage 0 and stage 1.
因此,Spark保留了这些阶段的中间结果,然后将其应用于 Stage 2 ?
So Spark keeps the intermediate results of these stages and then apply it on the Stage 2?
推荐答案您应该将单个阶段"视为可以在RDD的每个上执行的一系列转换分区,而无需访问其他分区中的数据;
You should think of a single "stage" as a series of transformations that can be performed on each of the RDD's partitions without having to access data in other partitions;
换句话说,如果我可以创建一个包含单个分区并产生一个新(单个)分区的操作T,并将相同的T应用于每个RDD的分区-T可以通过一个阶段"执行.
In other words, if I can create an operation T that takes in a single partition and produces a new (single) partition, and apply the same T to each of the RDD's partitions - T can be executed by a single "stage".
现在,阶段0 和阶段1 在两个单独的RDD 上运行并执行不同的转换,因此它们不能共享同一阶段.请注意,这两个阶段都不对另一个阶段的输出进行操作-因此它们不是创建单个阶段的候选对象".
Now, stage 0 and stage 1 operate on two separate RDDs and perform different transformations, so they can't share the same stage. Notice that neither of these stages operates on the output of the other - so they are not "candidates" for creating a single stage.
注意,这并不意味着它们不能并行运行 :Spark可以安排两个阶段同时运行;在这种情况下, stage 2 (执行 cogroup )将同时等待 stage 0 和 stage 1 完成,生成新分区,将它们洗牌给合适的执行者,然后对这些新分区进行操作.
NOTE that this doesn't mean they can't run in parallel: Spark can schedule both stages to run at the same time; In this case, stage 2 (which performs the cogroup) would wait for both stage 0 and stage 1 to complete, produce new partitions, shuffle them to the right executors, and then operate on these new partitions.
更多推荐
了解DAG的火花
发布评论