我是Power BI和Redshift的新手,需要一些指导.
Need some guidance as I am new to Power BI and Redshift ..
我的原始JSON数据以.gz文件的形式存储在Amazon S3存储桶中(每个.gz文件都有多行JSON数据)我想将Power BI连接到Amazon s3 Bucket.到目前为止,根据我的研究,我得到了三种方法:
My Raw JSON data is stored in Amazon S3 bucket in the form of .gz files (Each .gz file has multiple rows of JSON data) I wanted to connect Power BI to Amazon s3 Bucket. As of now based on my research I got three ways:
问题:是否可以解压缩.gz文件(在S3存储桶中或在Power BI中),从S3中提取JSON数据并连接到Power BI
Question: Is it possible to unzip the .gz file (inside the S3 bucket or Inside Power BI), extract JSON data from S3 and connect to Power BI
问题1 :Redshift是否允许从S3存储桶中加载.gzzipped JSON数据?如果是,是否可以直接执行?还是必须为其编写任何代码?
Question 1: Does Redshift Allows Loading .gzzipped JSON data from the S3 bucket? If Yes, is it directly possible or do I have to write any code for it?
问题2:我有S3帐户,我是否必须单独购买Redshift帐户/空间?费用是多少?
Question 2: I have the S3 account, do I have to separately purchase Redshift Account/Space? What is the cost?
U-SQL识别文件扩展名为.gz的GZip压缩文件,并在提取过程中自动将其解压缩.如果我的压缩文件包含JSON数据行,此过程是否有效?
U-SQL recognize GZip compressed files with the file extension .gz and automatically decompress them as the part of the Extraction process. Is this process valid, if my gzipped files contain JSON data rows?
如果有其他方法,请允许我,也请您提出宝贵的建议.
Please let me if there is any other method, also your valuable suggestions on this post.
预先感谢.
推荐答案关于您的第一个问题:我最近也遇到了类似的问题(但是提取了一个csv),我想注册我的解决方案.
About your first Question: I've just faced a similar issue recently (but extracting a csv) and I would like to register my solution.
Power BI仍然没有直接的插件来下载S3存储桶,但是您可以使用python脚本来完成.获取数据->Python脚本
Power BI still don't have a direct plugin to download S3 buckets, but you can do it using a python script. Get data --> Python Script
PS .:确保将boto3和pandas库安装在Power BI选项中告知的Python主目录的同一文件夹(或子文件夹)中,或在Anaconda库文件夹中(c:\ users \ USERNAME \ anaconda3 \ lib \ site-packages).
PS.: make sure that boto3 and pandas libraries are installed in the same folder (or subfolders) of the Python home directory you informed in Power BI options, OR in Anaconda library folder (c:\users\USERNAME\anaconda3\lib\site-packages).
用于Python脚本选项的Power BI窗口
import boto3 import pandas as pd bucket_name= 'your_bucket' folder_name= 'the folder inside your bucket/' file_name = r'file_name.csv' # or .json in your case key=folder_name+file_name s3 = boto3.resource( service_name='s3', region_name='your_bucket_region', ## ex: 'us-east-2' aws_access_key_id=AWS_ACCESS_KEY_ID, aws_secret_access_key=AWS_SECRET_ACCESS_KEY ) obj = s3.Bucket(bucket_name).Object(key).get() df = pd.read_csv(obj['Body']) # or pd.read_json(obj['Body']) in your case
数据框将作为新查询导入(在本例中为"df")
The dataframe will be imported as a new query( named "df", in this example case)
显然,pandas库也可以获取压缩文件(例如.gz).请参阅以下主题:如何使用带有gzip压缩选项的pandas read_csv读取tar.gz文件?
Apparently pandas library can also also get a zipped file (.gz for example). See the following topic: How can I read tar.gz file using pandas read_csv with gzip compression option?
更多推荐
将Power BI连接到S3存储桶
发布评论