需要一些指导,因为我是 Power BI 和 Redshift 的新手..
Need some guidance as I am new to Power BI and Redshift ..
我的原始 JSON 数据以 .gz 文件的形式存储在 Amazon S3 存储桶中(每个 .gz 文件都有多行 JSON 数据)我想将 Power BI 连接到 Amazon s3 Bucket.截至目前,根据我的研究,我得到了三种方法:
My Raw JSON data is stored in Amazon S3 bucket in the form of .gz files (Each .gz file has multiple rows of JSON data) I wanted to connect Power BI to Amazon s3 Bucket. As of now based on my research I got three ways:
问题:是否可以解压缩 .gz 文件(在 S3 存储桶或 Power BI 内部),从 S3 中提取 JSON 数据并连接到 Power BI
Question: Is it possible to unzip the .gz file (inside the S3 bucket or Inside Power BI), extract JSON data from S3 and connect to Power BI
问题 1:Redshift 是否允许从 S3 存储桶加载 .gzzzipped JSON 数据?如果是,是直接可能的还是我必须为此编写任何代码?
Question 1: Does Redshift Allows Loading .gzzipped JSON data from the S3 bucket? If Yes, is it directly possible or do I have to write any code for it?
问题 2: 我有 S3 帐户,是否需要单独购买 Redshift 帐户/空间?费用是多少?
Question 2: I have the S3 account, do I have to separately purchase Redshift Account/Space? What is the cost?
U-SQL 识别文件扩展名为 .gz 的 GZip 压缩文件,并在提取过程中自动解压缩它们.如果我的 gzip 文件包含 JSON 数据行,此过程是否有效?
U-SQL recognize GZip compressed files with the file extension .gz and automatically decompress them as the part of the Extraction process. Is this process valid, if my gzipped files contain JSON data rows?
如果有其他方法请告诉我,也请您对本帖提出宝贵的建议.
Please let me if there is any other method, also your valuable suggestions on this post.
提前致谢.
推荐答案关于您的第一个问题:我最近遇到了类似的问题(但提取了一个 csv),我想注册我的解决方案.
About your first Question: I've just faced a similar issue recently (but extracting a csv) and I would like to register my solution.
Power BI 仍然没有用于下载 S3 存储桶的直接插件,但您可以使用 python 脚本来完成.获取数据 -->Python脚本
Power BI still don't have a direct plugin to download S3 buckets, but you can do it using a python script. Get data --> Python Script
PS.:确保 boto3 和 pandas 库安装在您在 Power BI 选项中告知的 Python 主目录的同一文件夹(或子文件夹)中,或在 Anaconda 库文件夹中 (c:usersUSERNAMEanaconda3libsite-packages).
PS.: make sure that boto3 and pandas libraries are installed in the same folder (or subfolders) of the Python home directory you informed in Power BI options, OR in Anaconda library folder (c:usersUSERNAMEanaconda3libsite-packages).
Python 脚本选项的 Power BI 窗口
import boto3 import pandas as pd bucket_name= 'your_bucket' folder_name= 'the folder inside your bucket/' file_name = r'file_name.csv' # or .json in your case key=folder_name+file_name s3 = boto3.resource( service_name='s3', region_name='your_bucket_region', ## ex: 'us-east-2' aws_access_key_id=AWS_ACCESS_KEY_ID, aws_secret_access_key=AWS_SECRET_ACCESS_KEY ) obj = s3.Bucket(bucket_name).Object(key).get() df = pd.read_csv(obj['Body']) # or pd.read_json(obj['Body']) in your case数据框将作为新查询导入(在本例中名为df")
The dataframe will be imported as a new query( named "df", in this example case)
显然 pandas 库也可以获取压缩文件(例如 .gz).请参阅以下主题:如何使用带有 gzip 压缩选项的 pandas read_csv 读取 tar.gz 文件?
Apparently pandas library can also also get a zipped file (.gz for example). See the following topic: How can I read tar.gz file using pandas read_csv with gzip compression option?
更多推荐
将 Power BI 连接到 S3 存储桶
发布评论