连接IPython笔记本以激活在不同机器上运行的主机(Connecting IPython notebook to spark master running in different machines

编程入门行业动态更新时间:2024-10-25 17:29:26

连接IPython笔记本以激活在不同机器上运行的主机(Connecting IPython notebook to spark master running in different machines)

我不知道这是否已在SO中得到解答，但我无法找到解决问题的方法。

我有一个在Google容器引擎中的docker容器中运行的IPython笔记本，容器基于此图像jupyter / all-spark-notebook

我还有一个使用谷歌云数据流创建的火花群集

Spark master和笔记本在不同的VM中运行，但在同一区域和区域中运行 。

我的问题是我正在尝试从IPython笔记本连接到spark master，但没有成功。我在我的python笔记本中使用了这段代码

import pyspark conf = pyspark.SparkConf() conf.setMaster("spark://<spark-master-ip or spark-master-hostname>:7077")

我刚开始使用spark，所以我确定我遗漏了一些东西（身份验证，安全......），

我在那里找到的是通过SSH隧道连接本地浏览器

有人已经做过这种设置吗？

先谢谢你

I don't know if this is already answered in SO but I couldn't find a solution to my problem.

I have an IPython notebook running in a docker container in Google Container Engine, the container is based on this image jupyter/all-spark-notebook

I have also a spark cluster created with google cloud dataproc

Spark master and the notebook are running in different VMs but in the same region and zone.

My problem is that I'm trying to connect to the spark master from the IPython notebook but without success. I use this snippet of code in my python notebook

import pyspark conf = pyspark.SparkConf() conf.setMaster("spark://<spark-master-ip or spark-master-hostname>:7077")

I just started working with spark, so I'm sure I'm missing something (authentication, security ...),

What I found over there is connecting a local browser over an SSH tunnel

Somebody already did this kind of set up?

Thank you in advance

最满意答案

Dataproc 在YARN上运行Spark ，因此您需要将master设置为'yarn-client'。您还需要在YARN ResourceManager上指向Spark，这需要一个记录不足的SparkConf - > Hadoop配置转换。您还必须告诉Spark有关群集的HDFS，以便它可以为YARN分配资源。如果您将用于Hadoop的Google云端存储连接器添加到您的映像中，则可以使用Google云端存储而不是HDFS。

尝试：

import pyspark conf = pyspark.SparkConf() conf.setMaster('yarn-client') conf.setAppName('My Jupyter Notebook') # 'spark.hadoop.foo.bar' sets key 'foo.bar' in the Hadoop Configuaration. conf.set('spark.hadoop.yarn.resourcemanager.address', '<spark-master-hostname>') conf.set('spark.hadoop.fs.default.name', 'hdfs://<spark-master-hostname>/') sc = pyspark.SparkContext(conf=conf)

对于更永久的配置，您可以将这些文件烘焙到本地文件'core-site.xml'中，如此处所述，将其放在本地目录中，并将HADOOP_CONF_DIR设置为环境中的该目录。

同样值得注意的是，虽然位于同一区域对于性能很重要，但它位于同一网络中，允许TCP在该网络中允许VM进行通信的内部IP地址之间。如果您使用的是default网络，那么default-allow-internal 防火墙规则就足够了。

希望有所帮助。

Dataproc runs Spark on YARN, so you need to set master to 'yarn-client'. You also need to point Spark at your YARN ResourceManager, which requires a under-documented SparkConf -> Hadoop Configuration conversion. You also have to tell Spark about HDFS on the cluster, so it can stage resources for YARN. You could use Google Cloud Storage instead of HDFS, if you baked The Google Cloud Storage Connector for Hadoop into your image.

Try:

For a more permanent config, you could bake these into a local file 'core-site.xml' as described here, place that in a local directory, and set HADOOP_CONF_DIR to that directory in your environment.

It's also worth noting that while being in the same Zone is important for performance, it is being in the same Network and allowing TCP between internal IP addresses in that network that allows your VMs to communicate. If you are using the default network, then the default-allow-internal firewall rule, should be sufficient.

Hope that helps.

更多推荐

本文发布于:2023-07-22 10:05:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1219209.html