我无法理解Apache Beam中的联接(例如 www.waitingforcode/apache-beam/joins-apache-beam/read )可以连接整行.
I'm having trouble understanding if the joins in Apache Beam (e.g. www.waitingforcode/apache-beam/joins-apache-beam/read) can join entire rows.
例如:
我有2个CSV格式的数据集,其中第一行是列标题.
I have 2 datasets, in CSV format, where the first rows are column headers.
第一个:
a,b,c,d 1,2,3,4 5,6,7,8 1,2,5,4第二个:
c,d,e,f 3,4,9,10我想在c和d列上保留连接,以便最终得到:
I want to left join on columns c and d so that I end up with:
a,b,c,d,e,f 1,2,3,4,9,10 5,6,7,8,, 1,2,5,4,,但是,关于Apache Beam的所有文档似乎都说,加入时PCollection对象的类型必须为KV<K, V>,因此我已将PCollection对象分解为KV<String, String>对象的集合(其中键是列标头,并且该值是行值).但是在那种情况下(您只有一个带有值的键),我看不到如何保持行格式. KV(c,7)如何知道" KV(a,5)来自同一行?是加入根本就意味着这种事情吗?
However all the documentation on Apache Beam seems to say the PCollection objects need to be of type KV<K, V> when joining, so I have broken down my PCollection objects to a collection of KV<String, String> objects (where the key is the column header, and the value is row value). But in that case (where you just have a key with a value) I don't see how the row format can be maintained. How would KV(c,7) "know" that KV(a,5) is from the same row? Is Join meant for this sort of thing at all?
到目前为止,我的代码:
My code so far:
PCollection<KV<String, String>> flightOutput = ...; PCollection<KV<String, String>> arrivalWeatherDataForJoin = ...; PCollection<KV<String, KV<String, String>>> output = Join.leftOuterJoin(flightOutput, arrivalWeatherDataForJoin, "");推荐答案
是的,Join是实用程序类,可帮助您进行类似的联接.它是CoGropByKey的包装,请参见相应部分文档.它的实现是非常短.它的测试可能也有帮助的示例.
Yes, Join is the utility class to help with joins like yours. It is a wrapper around CoGropByKey, see the corresponding section in the docs. The implementation of it is pretty short. Its tests might also have helpful examples.
您所遇到的问题很可能是由于您选择键的方式引起的.
Problem in your case is likely caused by how you're choosing the keys.
Join库中的KeyT int KV<KeyT,V1>表示您用来匹配记录的键,它包含所有联接字段.因此,在您的情况下,您可能需要分配类似以下的键(伪代码):
The KeyT int KV<KeyT,V1> in the Join library represents the key which you are using to match the records, it contains all the join fields. So in your case you will probably need to assign keys something like this (pseudocode):
pCollection1: Key Value (3,4) (1,2,3,4) (7,8) (5,6,7,8) (5,4) (1,2,5,4) pCollection2: Key Value (3,4) (3,4,9,10)连接的结果看起来像这样(伪代码):
And what will come of the join will look something like this (pseudocode):
joinResultPCollection: Key Value (3,4) (1,2,3,4),(3,4,9,10) (7,8) (5,6,7,8),nullValue (5,4) (1,2,5,4),nullValue因此,您可能需要在连接后添加另一个转换,以将左侧和右侧实际合并到合并的行中.
So you will probably need to add another transform after join to actually merge the left and right side into a combined row.
因为有CSV,所以可能可以将"3,4"之类的实际字符串用作键(和值).或者,您可以使用Lists<>或自定义行类型.
Because you have a CSV, you probably could use actual strings like "3,4" as keys (and values). Or you could use Lists<> or your custom row types.
例如,这正是 Beam SQL Join实现完成.
更多推荐
在Apache Beam中联接行
发布评论