PySpark使用IAM角色访问S3

编程入门行业动态更新时间:2024-10-27 20:39:03

本文介绍了PySpark使用IAM角色访问S3的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我想知道PySpark是否支持使用IAM角色进行S3访问.具体来说，我有一个业务约束，我必须承担一个AWS角色才能访问给定的存储桶.使用boto时很好(因为它是API的一部分)，但我无法找到关于PySpark是否开箱即用地支持的明确答案.

理想情况下，我希望能够在本地以独立模式运行时扮演一个角色，并将我的SparkContext指向该s3路径.我已经看到非IAM呼叫通常遵循:

spark_conf = SparkConf().setMaster('local[*]').setAppName('MyApp') sc = SparkContext(conf=spark_conf) rdd = sc.textFile('s3://<MY-ID>:<MY-KEY>@some-bucket/some-key')

是否存在用于提供IAM信息的类似内容? :

rdd = sc.textFile('s3://<MY-ID>:<MY-KEY>:<MY-SESSION>@some-bucket/some-key')

或

rdd = sc.textFile('s3://<ROLE-ARN>:<ROLE-SESSION-NAME>@some-bucket/some-key')

如果没有，使用IAM凭据的最佳实践是什么?甚至有可能吗?

我正在使用Python 1.7和PySpark 1.6.0

谢谢！

解决方案

只有 s3a 支持访问s3的IAM角色，因为它使用的是AWS开发工具包.

您需要将hadoop-aws JAR和aws-java-sdk JAR(及其包装中的第三方Jar)放入您的CLASSPATH中.

hadoop-aws 链接.

aws-java-sdk 链接.. >

然后在core-site.xml中设置:

<property> <name>fs.s3.impl</name> <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value> </property> <property> <name>fs.s3a.impl</name> <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value> </property>

I'm wondering if PySpark supports S3 access using IAM roles. Specifically, I have a business constraint where I have to assume an AWS role in order to access a given bucket. This is fine when using boto (as it's part of the API), but I can't find a definitive answer as to if PySpark supports this out of the box.

Ideally, I'd like to be able to assume a role when running in standalone mode locally and point my SparkContext to that s3 path. I've seen that non-IAM calls usually follow :

spark_conf = SparkConf().setMaster('local[*]').setAppName('MyApp') sc = SparkContext(conf=spark_conf) rdd = sc.textFile('s3://<MY-ID>:<MY-KEY>@some-bucket/some-key')

Does something like this exist for providing IAM info? :

rdd = sc.textFile('s3://<MY-ID>:<MY-KEY>:<MY-SESSION>@some-bucket/some-key')

rdd = sc.textFile('s3://<ROLE-ARN>:<ROLE-SESSION-NAME>@some-bucket/some-key')

If not, what are the best practices for working with IAM creds? Is it even possible?

I'm using Python 1.7 and PySpark 1.6.0

Thanks!

解决方案

IAM role for accessing s3 is only support by s3a, because it is using AWS SDK.

You need to put hadoop-aws JAR and aws-java-sdk JAR (and third-party Jars in its package) into your CLASSPATH.

hadoop-aws link.

aws-java-sdk link.

Then set this in core-site.xml: