Dataproc + BigQuery示例

编程入门行业动态更新时间:2024-10-10 02:24:03

本文介绍了Dataproc + BigQuery示例 - 任何可用的？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

根据Dataproc docos ，它具有自动与自动集成与BigQuery 。

我在BigQuery中有一个表格。我想读取该表并使用我创建的Dataproc集群（使用PySpark作业）对其进行一些分析。然后将此分析的结果写回到BigQuery。您可能会问为什么不直接在BigQuery中进行分析！？ - 原因是因为我们正在创建复杂的统计模型，而SQL对于开发它们来说太高。我们需要像Python或R，ergo Dataproc之类的东西。

他们是否有Dataproc + BigQuery示例？我找不到任何。 解决方案

首先，如这个问题 BigQuery连接器预装在 Cloud Dataproc 集群。

这里是一个关于如何将数据从BigQuery读取到Spark中的例子。在这个例子中，我们将从BigQuery读取数据来执行字数统计。您可以使用 SparkContext.newAPIHadoopRDD 从Spark中的BigQuery中读取数据。 Spark文档有有关使用 SparkContext.newAPIHadoopRDD 的更多信息。 '
import com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration import com.google.cloud.hadoop.io.bigquery.GsonBigQueryInputFormat import com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredInputFormat import com.google.gson.JsonObject import org.apache.hadoop.io.LongWritable $ b $ val projectId =< your-project-id> val fullyQualifiedInputTableId =publicdata：samples.shakespeare val fullyQualifiedOutputTableId =< your-fully-qualified-table-id> val outputTableSchema = [{'name'：'Word'，'type'：'STRING'}，{'name'：'Count'，'type'：'INTEGER'}] val jobName =wordcount val conf = sc.hadoopConfiguration //设置作业级别的projectId。 conf.set（BigQueryConfiguration.PROJECT_ID_KEY，projectId） //使用systemBucket获取InputFormat使用的临时BigQuery导出数据。 val systemBucket = conf.get（fs.gs.system.bucket） conf.set（BigQueryConfiguration.GCS_BUCKET_KEY，systemBucket） //将输入和输出配置为BigQuery访问。 BigQueryConfiguration.configureBigQueryInput（conf，fullyQualifiedInputTableId） BigQueryConfiguration.configureBigQueryOutput（conf， fullyQualifiedOutputTableId，outputTableSchema） $ b $ val fieldName =word val tableData = sc.newAPIHadoopRDD（conf， classOf [GsonBigQueryInputFormat]，classOf [LongWritable]，classOf [JsonObject]） tableData.cache（） tableData.count（） tableData.map（entry =>（entry._1.toString（），entry._2.toString（）））。take（10）
您需要使用您的设置自定义此示例，其中包括您在< your-project-id> 和你的输出表ID在< your-fully-qualified-table-id> 。
如果最终在MapReduce中使用BigQuery连接器，此页面如何编写MapReduce作业的示例使用BigQuery连接器。

According to the Dataproc docos, it has "native and automatic integrations with BigQuery".

I have a table in BigQuery. I want to read that table and perform some analysis on it using the Dataproc cluster that I've created (using a PySpark job). Then write the results of this analysis back to BigQuery. You may be asking "why not just do the analysis in BigQuery directly!?" - the reason is because we are creating complex statistical models, and SQL is too high level for developing them. We need something like Python or R, ergo Dataproc.

Are they any Dataproc + BigQuery examples available? I can't find any.
解决方案
To begin, as noted in this question the BigQuery connector is preinstalled on Cloud Dataproc clusters.

Here is an example on how to read data from BigQuery into Spark. In this example, we will read data from BigQuery to perform a word count. You read data from BigQuery in Spark using SparkContext.newAPIHadoopRDD. The Spark documentation has more information about using SparkContext.newAPIHadoopRDD. '
import com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration import com.google.cloud.hadoop.io.bigquery.GsonBigQueryInputFormat import com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredInputFormat import com.google.gson.JsonObject import org.apache.hadoop.io.LongWritable val projectId = "<your-project-id>" val fullyQualifiedInputTableId = "publicdata:samples.shakespeare" val fullyQualifiedOutputTableId = "<your-fully-qualified-table-id>" val outputTableSchema = "[{'name': 'Word','type': 'STRING'},{'name': 'Count','type': 'INTEGER'}]" val jobName = "wordcount" val conf = sc.hadoopConfiguration // Set the job-level projectId. conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId) // Use the systemBucket for temporary BigQuery export data used by the InputFormat. val systemBucket = conf.get("fs.gs.system.bucket") conf.set(BigQueryConfiguration.GCS_BUCKET_KEY, systemBucket) // Configure input and output for BigQuery access. BigQueryConfiguration.configureBigQueryInput(conf, fullyQualifiedInputTableId) BigQueryConfiguration.configureBigQueryOutput(conf, fullyQualifiedOutputTableId, outputTableSchema) val fieldName = "word" val tableData = sc.newAPIHadoopRDD(conf, classOf[GsonBigQueryInputFormat], classOf[LongWritable], classOf[JsonObject]) tableData.cache() tableData.count() tableData.map(entry => (entry._1.toString(),entry._2.toString())).take(10)
You will need to customize this example with your settings, including your Cloud Platform project ID in <your-project-id> and your output table ID in <your-fully-qualified-table-id>.

Finally, if you end up using the BigQuery connector with MapReduce, this page has examples for how to write MapReduce jobs with the BigQuery connector.

更多推荐

Dataproc + BigQuery示例

本文发布于:2023-05-31 10:29:39，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/389909.html

版权声明:本站内容均来自互联网，仅供演示用，请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系，我们将在24小时内删除。

示例 Dataproc BigQuery

上一篇：线段树详解

下一篇：如何将两个不同的ViewResults中的两个不同视图发送到MVC中同一页面上的两个不同的tabpages。

发布评论取消回复

评论列表（有 0 条评论）

最近发表

荆门网站建设的重要性

win10蓝屏终止代码CRITICAL_PROCESS_DIED解决方法

您可以尝试添加 --skip-broken 选项来解决该问题您可以尝试执行：rpm -Va --nofiles --nodigest 解决方案

关于无线网络波动大的解决办法

Windows10 关于系统中断CPU占用过高导致电脑变卡的解决办法

VS 2019 点击页面自动定位到解决方案资源管理器目录位置

（亲测解决）VMware打开需要半天才进入、打开系统很慢、运行很慢解决办法

Typora官网下载的最新版本mac10.13以下版本用不了的解决办法

成功解决ModuleNotFoundError: No module named ‘torch._C‘

MySQL:由于找不到VCRUNTIME140_1.dll，无法继续执行代码。重新安装程序可能会解决此问题

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍！

热门文章

从源“http://localhost:5173”访问“...”处的 XMLHttpRequest 已被 CORS 策略阻止

币安API错误代码1102，未发送强制参数“时间戳”

如果我在bot telegram nodejs中使用editMessageMedia，我如何制作标题

在 Node.js 中从网络流创建 blob

使用 Node.js / ES6 如何设置 dotenv 文件的自定义路径？

使用 NODE.JS 和 html5 实现低延迟（50 毫秒）视频流

如何从nodejs连接laravel>laravel

使用nodejs观看目录

如果文件包含特定字符串，如何跳过 GitHub 工作流程步骤？

FirebaseError：无法从.env加载环境变量

标签列表

文件

如何在

Python

系统

java

方法

数据

错误

windows

函数

android

linux

教程

如何使用

代码

字符串

计算机

电脑

服务器

NET

应用程序

数组

PHP

MySQL

SQL

对象

项目

程序

数据库

word