考虑到Spark任务的数量不能大于内核的数量,运行比内核数量更多的分区是否有意义?如果是这样,您能详细说明吗?
解决方案
- 如前所述,您需要至少有1个任务/核心才能利用所有群集的资源.
- 根据每个阶段/任务上所需的处理类型,您可能会有处理/数据偏斜-通过使分区更小/更多的分区可以以某种方式缓解,因此您可以更好地利用群集(例如,执行程序运行更长的时间)耗时5分钟的任务,其他执行者能够运行10项较短的任务,耗时30秒).
-
在其他情况下,您可能希望增加分区数量(例如,如果遇到大小/内存限制).
查看有关并行调整的这篇不错的文章:
Does it ever make sense to run more partitions than number of cores given that the number of Spark tasks cannot be higher than the number of cores? If so, could you elaborate?
解决方案- As you mentioned you need to have at least 1 task / core to make use of all cluster's resources.
- Depending on processing type required on each stage/task you may have processing/data skew - that can be somehow alleviated by making partitions smaller / more partitions so you have a better utilization of the cluster (e.g. while an executor runs a longer task that takes 5 minutes other executors are able to run 10 shorter tasks of 30 seconds).
There might be other scenarios where you want to increase the number of partitions (e.g. if you hit size / memory limitations).
Check out this nice article about parallelism tuning: blog.cloudera/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
Update: How this can help with processing/data skew and give you a better cluster utilization and faster job execution (an example screenshot below from Spark UI where you can see the skew between tasks - see diff Median vs Max task duration):
Let's say you have a cluster that can run 2 tasks in parallel.
- Processing the data takes 60 minutes with 1 task (1 idle core) - job takes 60m.
- If you split it in 2 you may find because of the skew: Task1: 45m, Task-2: 15m. Job takes 45m (for 30m you had 1 idle core).
- If you split it in 4 you may get: Task1: 30m, Task-2: 10m, Task-3: 10m, Task-4: 10m. Job takes 30m (1st core runs 1 task for 30m while the other runs the other 3 smaller tasks of 10m each). etc.
更多推荐
运行更多的分区而不是核心数量是否有意义?
发布评论