我正在尝试创建一个工作流,其中AWS Glue ETL作业将从外部REST API而非S3或任何其他AWS内部源中提取JSON数据.那有可能吗?有人吗请帮忙!
I'm trying to create a workflow where AWS Glue ETL job will pull the JSON data from external REST API instead of S3 or any other AWS-internal sources. Is that even possible? Anyone does it? Please help!
推荐答案是的,我确实从REST API(例如Twitter,FullStory,Elasticsearch等)中提取数据.通常,我确实使用Python Shell作业进行提取,因为它们是更快(冷启动相对较小).完成后,它会触发一个Spark类型的作业,该作业仅读取我需要的json项.我使用请求pyhton库.
Yes, I do extract data from REST API's like Twitter, FullStory, Elasticsearch, etc. Usually, I do use the Python Shell jobs for the extraction because they are faster (relatively small cold start). When is finished it triggers a Spark type job that reads only the json items I need. I use the requests pyhton library.
为了将数据保存到S3中,您可以执行以下操作
In order to save the data into S3 you can do something like this
import boto3 import json # Initializes S3 client s3 = boto3.resource('s3') tweets = [] //Code that extracts tweets from API tweets_json = json.dumps(tweets) obj = s3.Object("my-tweets", "tweets.json") obj.put(Body=data)更多推荐
来自外部REST API的AWS Glue作业消耗数据
发布评论