运行更多的分区而不是核心数量是否有意义?

编程入门 行业动态 更新时间:2024-10-15 20:23:18
本文介绍了运行更多的分区而不是核心数量是否有意义?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

考虑到Spark任务的数量不能大于内核的数量,运行比内核数量更多的分区是否有意义?如果是这样,您能详细说明吗?

解决方案

  • 如前所述,您需要至少有1个任务/核心才能利用所有群集的资源.
  • 根据每个阶段/任务上所需的处理类型,您可能会有处理/数据偏斜-通过使分区更小/更多的分区可以以某种方式缓解,因此您可以更好地利用群集(例如,执行程序运行更长的时间)耗时5分钟的任务,其他执行者能够运行10项较短的任务,耗时30秒).
  • 在其他情况下,您可能希望增加分区数量(例如,如果遇到大小/内存限制).

    查看有关并行调整的这篇不错的文章:

    Does it ever make sense to run more partitions than number of cores given that the number of Spark tasks cannot be higher than the number of cores? If so, could you elaborate?

    解决方案

    • As you mentioned you need to have at least 1 task / core to make use of all cluster's resources.
    • Depending on processing type required on each stage/task you may have processing/data skew - that can be somehow alleviated by making partitions smaller / more partitions so you have a better utilization of the cluster (e.g. while an executor runs a longer task that takes 5 minutes other executors are able to run 10 shorter tasks of 30 seconds).
    • There might be other scenarios where you want to increase the number of partitions (e.g. if you hit size / memory limitations).

      Check out this nice article about parallelism tuning: blog.cloudera/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

    Update: How this can help with processing/data skew and give you a better cluster utilization and faster job execution (an example screenshot below from Spark UI where you can see the skew between tasks - see diff Median vs Max task duration):

    Let's say you have a cluster that can run 2 tasks in parallel.

    • Processing the data takes 60 minutes with 1 task (1 idle core) - job takes 60m.
    • If you split it in 2 you may find because of the skew: Task1: 45m, Task-2: 15m. Job takes 45m (for 30m you had 1 idle core).
    • If you split it in 4 you may get: Task1: 30m, Task-2: 10m, Task-3: 10m, Task-4: 10m. Job takes 30m (1st core runs 1 task for 30m while the other runs the other 3 smaller tasks of 10m each). etc.

更多推荐

运行更多的分区而不是核心数量是否有意义?

本文发布于:2023-11-24 13:27:46,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1625386.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:有意义   分区   而不是   数量   核心

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!