将Pandas DataFrame上传到GCP存储桶以获取Dataproc

本文介绍了将Pandas DataFrame上传到GCP存储桶以获取Dataproc的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我一直在使用Data Proc谷歌云服务进行Spark Cluster机器学习建模.我已经成功加载了数据从Google存储桶中.但是，我不确定如何将熊猫的数据帧和spark数据帧作为csv写入云存储桶.

当我使用以下命令时，它给我一个错误

df.to_csv("gs://mybucket/") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py", line 1745, in to_csv formatter.save() File "/opt/conda/lib/python3.6/site-packages/pandas/io/formats/csvs.py", line 156, in save compression=selfpression) File "/opt/conda/lib/python3.6/site-packages/pandas/io/common.py", line 400, in _get_handle f = open(path_or_buf, mode, encoding=encoding) FileNotFoundError: [Errno 2] No such file or directory: 'gs://dataproc-78f5e64b-a26d-4fe4-bcf9-e1b894db9d8f-au-southeast1/trademe_xmas.csv' FileNotFoundError: [Errno 2] No such file or directory: 'gs://mybucket/'

但是以下命令可以工作，但是我不确定它在哪里保存文件

df.to_csv("data.csv")

我还关注了以下文章，并给出了以下错误将Pandas DataFrame写入Google Cloud Storage或BigQuery

import google.datalab.storage as storage ModuleNotFoundError: No module named 'google.datalab'

我对Google Cloud Data Proc和Spark相对较新，我希望有人能帮助我了解如何将输出的熊猫数据帧保存到gcloud存储桶中

先谢谢了！！

########针对Igor的要求

from pyspark.ml.classification import RandomForestClassifier as RF rf = RF(labelCol='label', featuresCol='features',numTrees=200) fit = rf.fit(trainingData) transformed = fit.transform(testData) from pyspark.mllib.evaluation import BinaryClassificationMetrics as metric results = transformed.select(['probability', 'label']) #Decile Creation for the Output test = results.toPandas() test['X0'] = test.probability.str[0] test['X1'] = test.probability.str[1] test = test.drop(columns=['probability']) test = test.sort_values(by='X1', ascending=False) test['rownum'] = test.reset_index().index x = round(test['rownum'].count() / 10) test['rank'] = (test.rownum - 1)//x + 1

解决方案

最简单的方法应该是将Pandas DataFrame转换为Spark DataFrame并将其写入GCS.

有关如何执行此操作的说明: stackoverflow/a/45495969/3227693

I have been working on Spark Cluster using Data Proc google cloud services for Machine Learning Modelling. I have been successful to load the data from the Google Storage bucket. However, I am not sure how to write the panda's data frame and spark data frame to the cloud storage bucket as csv.

When I use the below command it gives me an error

df.to_csv("gs://mybucket/") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py", line 1745, in to_csv formatter.save() File "/opt/conda/lib/python3.6/site-packages/pandas/io/formats/csvs.py", line 156, in save compression=selfpression) File "/opt/conda/lib/python3.6/site-packages/pandas/io/common.py", line 400, in _get_handle f = open(path_or_buf, mode, encoding=encoding) FileNotFoundError: [Errno 2] No such file or directory: 'gs://dataproc-78f5e64b-a26d-4fe4-bcf9-e1b894db9d8f-au-southeast1/trademe_xmas.csv' FileNotFoundError: [Errno 2] No such file or directory: 'gs://mybucket/'

however the following command work but I am not sure where it is saving the file

df.to_csv("data.csv")

I also followed the below article and it gives the following error Write a Pandas DataFrame to Google Cloud Storage or BigQuery

import google.datalab.storage as storage ModuleNotFoundError: No module named 'google.datalab'

I am relatively new to Google Cloud Data Proc and Spark and I was hoping if someone can help me understand how can I save my output pandas data frame to gcloud bucket

Thanks in Advance !!

########For Igor as Requested

from pyspark.ml.classification import RandomForestClassifier as RF rf = RF(labelCol='label', featuresCol='features',numTrees=200) fit = rf.fit(trainingData) transformed = fit.transform(testData) from pyspark.mllib.evaluation import BinaryClassificationMetrics as metric results = transformed.select(['probability', 'label']) #Decile Creation for the Output test = results.toPandas() test['X0'] = test.probability.str[0] test['X1'] = test.probability.str[1] test = test.drop(columns=['probability']) test = test.sort_values(by='X1', ascending=False) test['rownum'] = test.reset_index().index x = round(test['rownum'].count() / 10) test['rank'] = (test.rownum - 1)//x + 1

解决方案

The easiest should be to convert Pandas DataFrame to Spark DataFrame and write it to GCS.

Here's instructions on how to do this: stackoverflow/a/45495969/3227693

更多推荐

将Pandas DataFrame上传到GCP存储桶以获取Dataproc