在Apache Beam中联接行

编程入门 行业动态 更新时间:2024-10-09 23:20:50
本文介绍了在Apache Beam中联接行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我无法理解Apache Beam中的联接(例如 www.waitingforcode/apache-beam/joins-apache-beam/read )可以连接整行.

I'm having trouble understanding if the joins in Apache Beam (e.g. www.waitingforcode/apache-beam/joins-apache-beam/read) can join entire rows.

例如:

我有2个CSV格式的数据集,其中第一行是列标题.

I have 2 datasets, in CSV format, where the first rows are column headers.

第一个:

a,b,c,d 1,2,3,4 5,6,7,8 1,2,5,4

第二个:

c,d,e,f 3,4,9,10

我想在c和d列上保留连接,以便最终得到:

I want to left join on columns c and d so that I end up with:

a,b,c,d,e,f 1,2,3,4,9,10 5,6,7,8,, 1,2,5,4,,

但是,关于Apache Beam的所有文档似乎都说,加入时PCollection对象的类型必须为KV<K, V>,因此我已将PCollection对象分解为KV<String, String>对象的集合(其中键是列标头,并且该值是行值).但是在那种情况下(您只有一个带有值的键),我看不到如何保持行格式. KV(c,7)如何知道" KV(a,5)来自同一行?是加入根本就意味着这种事情吗?

However all the documentation on Apache Beam seems to say the PCollection objects need to be of type KV<K, V> when joining, so I have broken down my PCollection objects to a collection of KV<String, String> objects (where the key is the column header, and the value is row value). But in that case (where you just have a key with a value) I don't see how the row format can be maintained. How would KV(c,7) "know" that KV(a,5) is from the same row? Is Join meant for this sort of thing at all?

到目前为止,我的代码:

My code so far:

PCollection<KV<String, String>> flightOutput = ...; PCollection<KV<String, String>> arrivalWeatherDataForJoin = ...; PCollection<KV<String, KV<String, String>>> output = Join.leftOuterJoin(flightOutput, arrivalWeatherDataForJoin, "");

推荐答案

是的,Join是实用程序类,可帮助您进行类似的联接.它是CoGropByKey的包装,请参见相应部分文档.它的实现是非常短.它的测试可能也有帮助的示例.

Yes, Join is the utility class to help with joins like yours. It is a wrapper around CoGropByKey, see the corresponding section in the docs. The implementation of it is pretty short. Its tests might also have helpful examples.

您所遇到的问题很可能是由于您选择键的方式引起的.

Problem in your case is likely caused by how you're choosing the keys.

Join库中的KeyT int KV<KeyT,V1>表示您用来匹配记录的键,它包含所有联接字段.因此,在您的情况下,您可能需要分配类似以下的键(伪代码):

The KeyT int KV<KeyT,V1> in the Join library represents the key which you are using to match the records, it contains all the join fields. So in your case you will probably need to assign keys something like this (pseudocode):

pCollection1: Key Value (3,4) (1,2,3,4) (7,8) (5,6,7,8) (5,4) (1,2,5,4) pCollection2: Key Value (3,4) (3,4,9,10)

连接的结果看起来像这样(伪代码):

And what will come of the join will look something like this (pseudocode):

joinResultPCollection: Key Value (3,4) (1,2,3,4),(3,4,9,10) (7,8) (5,6,7,8),nullValue (5,4) (1,2,5,4),nullValue

因此,您可能需要在连接后添加另一个转换,以将左侧和右侧实际合并到合并的行中.

So you will probably need to add another transform after join to actually merge the left and right side into a combined row.

因为有CSV,所以可能可以将"3,4"之类的实际字符串用作键(和值).或者,您可以使用Lists<>或自定义行类型.

Because you have a CSV, you probably could use actual strings like "3,4" as keys (and values). Or you could use Lists<> or your custom row types.

例如,这正是 Beam SQL Join实现完成.

更多推荐

在Apache Beam中联接行

本文发布于:2023-10-28 10:47:16,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1536414.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:Apache   Beam

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!