了解DAG的火花

编程入门 行业动态 更新时间:2024-10-09 16:25:52
本文介绍了了解DAG的火花的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

问题是我有以下DAG:

The question is I have the following DAG:

我认为,当需要改组时,spark会在不同阶段分配工作.考虑阶段0和阶段1.有些操作不需要改组.那么,为什么Spark将它们分成不同的阶段?

I thought that spark devides a job in different stages when shuffling is required. Consider Stage 0 and Stage 1. There are operation which do not require shuffling. So why does Spark split them into different stages?

我认为跨分区的实际数据迁移应该已经在阶段2进行了.因为在这里我们需要 cogroup .但是要进行分组,我们需要来自 stage 0 和 stage 1 的数据.

I thought that actual moving of data across partitions should have happened at Stage 2. Because here we need to cogroup. But to cogroup we need data from stage 0 and stage 1.

因此,Spark保留了这些阶段的中间结果,然后将其应用于 Stage 2 ?

So Spark keeps the intermediate results of these stages and then apply it on the Stage 2?

推荐答案

您应该将单个阶段"视为可以在RDD的每个上执行的一系列转换分区,而无需访问其他分区中的数据;

You should think of a single "stage" as a series of transformations that can be performed on each of the RDD's partitions without having to access data in other partitions;

换句话说,如果我可以创建一个包含单个分区并产生一个新(单个)分区的操作T,并将相同的T应用于每个RDD的分区-T可以通过一个阶段"执行.

In other words, if I can create an operation T that takes in a single partition and produces a new (single) partition, and apply the same T to each of the RDD's partitions - T can be executed by a single "stage".

现在,阶段0 和阶段1 在两个单独的RDD 上运行并执行不同的转换,因此它们不能共享同一阶段.请注意,这两个阶段都不对另一个阶段的输出进行操作-因此它们不是创建单个阶段的候选对象".

Now, stage 0 and stage 1 operate on two separate RDDs and perform different transformations, so they can't share the same stage. Notice that neither of these stages operates on the output of the other - so they are not "candidates" for creating a single stage.

注意,这并不意味着它们不能并行运行 :Spark可以安排两个阶段同时运行;在这种情况下, stage 2 (执行 cogroup )将同时等待 stage 0 和 stage 1 完成,生成新分区,将它们洗牌给合适的执行者,然后对这些新分区进行操作.

NOTE that this doesn't mean they can't run in parallel: Spark can schedule both stages to run at the same time; In this case, stage 2 (which performs the cogroup) would wait for both stage 0 and stage 1 to complete, produce new partitions, shuffle them to the right executors, and then operate on these new partitions.

更多推荐

了解DAG的火花

本文发布于:2023-11-23 22:15:17,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1623009.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:火花   DAG

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!