为什么Spark(在Google Dataproc上)不使用所有vcore?

编程入门行业动态更新时间:2024-10-25 02:27:00

本文介绍了为什么Spark(在Google Dataproc上)不使用所有vcore?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我正在Google DataProc集群上运行Spark作业.但是看起来Spark并未使用集群中所有可用的 vcores ，如下所示

I'm running a spark job on a Google DataProc cluster. But looks like Spark is not using all the vcores available in the cluster as you can see below

基于其他一些问题，例如此和此，我已经设置好了群集使用 DominantResourceCalculator 来同时考虑vcpus和内存以进行资源分配

Based on some other questions like this and this, i have setup the cluster to use DominantResourceCalculator to consider both vcpus and memory for resource allocation

gcloud dataproc clusters create cluster_name --bucket="profiling- job-default" \ --zone=europe-west1-c \ --master-boot-disk-size=500GB \ --worker-boot-disk-size=500GB \ --master-machine-type=n1-standard-16 \ --num-workers=10 \ --worker-machine-type=n1-standard-16 \ --initialization-actions gs://custom_init_gcp.sh \ --metadata MINICONDA_VARIANT=2 \ --properties=^--^yarn:yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator

但是当我用自定义火花标记提交工作时，看起来YARN不遵守这些自定义参数，并且默认使用内存作为进行资源计算的准绳

But when i submit my job with custom spark flags, looks like YARN doesn't respect these custom parameters and defaults to using memory as the yardstick for resource calculation

gcloud dataproc jobs submit pyspark --cluster cluster_name \ --properties spark.sql.broadcastTimeout=900,sparkwork.timeout=800\ ,yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator\ ,spark.dynamicAllocation.enabled=true\ ,spark.executor.instances=10\ ,spark.executor.cores=14\ ,spark.executor.memory=15g\ ,spark.driver.memory=50g \ src/my_python_file.py

可以帮助别人弄清楚这里发生了什么吗?

Can help somebody figure out what's going on here?

推荐答案

我做错了是添加配置 yarn.scheduler.capacity.resource-calculator = org.apache.hadoop.yarn.util.resource.DominantResourceCalculator 转换为 YARN ，而不是集群创建时的 capacity-scheduler.xml (应该正确)

What I did wrong was to add the configuration yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator to YARN instead of the capacity-scheduler.xml (as it should be rightly) while cluster creation

第二，我更改了最初设置为 1 的 yarn:yarn.scheduler.minimum-allocation-vcores .

Secondly, i changed yarn:yarn.scheduler.minimum-allocation-vcores which was initially set to 1.

我不确定这些更改之一还是全部导致了解决方案(我会很快更新).我的新集群创建如下:

I'm not sure if either one of these or both of these changes led to the solution (i will update soon). My new cluster creation looks like below:

gcloud dataproc clusters create cluster_name --bucket="profiling- job-default" \ --zone=europe-west1-c \ --master-boot-disk-size=500GB \ --worker-boot-disk-size=500GB \ --master-machine-type=n1-standard-16 \ --num-workers=10 \ --worker-machine-type=n1-standard-16 \ --initialization-actions gs://custom_init_gcp.sh \ --metadata MINICONDA_VARIANT=2 \ --properties=^--^yarn:yarn.scheduler.minimum-allocation-vcores=4--capacity-scheduler:yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator