Apache pig group by function没有给出预期的输出(Apache pig group by function is not giving expected output)

编程入门 行业动态 更新时间:2024-10-28 16:22:10
Apache pig group by function没有给出预期的输出(Apache pig group by function is not giving expected output)

我有csv格式的数据,如下所示。

数据具有以下格式

"first_name","last_name","company_name","address","city","county","postal","phone1","phone2","email","web"

User.csv下的示例数据。 该文件包含以下数据。

"Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14, Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz@hotmail.com","http://www.alandrosenburgcpapc.co.uk" "Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas@gmail.com","http://www.capgeminiamerica.co.uk" "France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade@hotmail.com","http://www.elliottjohnwesq.co.uk"

当我尝试使用PigStorage加载PigStorage

user = LOAD '/home/abhijit/Downloads/User.csv' USING PigStorage(','); DUMP user;

它的输出如下:

("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14 Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz@hotmail.com","http://www.alandrosenburgcpapc.co.uk") ("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas@gmail.com","http://www.capgeminiamerica.co.uk") ("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade@hotmail.com","http://www.elliottjohnwesq.co.uk")

我想在城市做一组。 所以我写了

grp = group user by $4; dump grp;

我得到的输出为:

( Binney St",{("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas@gmail.com","http://www.capgeminiamerica.co.uk")}) ("8 Moor Place",{("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade@hotmail.com","http://www.elliottjohnwesq.co.uk")}) ("St. Stephens Ward",{("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14 Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz@hotmail.com","http://www.alandrosenburgcpapc.co.uk")})

company_name和address正在创建一个问题,因为它包含','作为其中的一部分。 例如地址中的"14, Taylor St"或company_name中的"Elliott, John W Esq" 。

所以我的$4是为"Taylor St"而不是"St. Stephens Ward"

因此,由于地址数据中的额外分隔符或者公司名称数据没有正确加载或正确分离,并且按功能分组没有给出正确的结果。

如何通过输出实现组,如下所示

("Abbey Ward",{("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas@gmail.com","http://www.capgeminiamerica.co.uk")}) ("St. Stephens Ward",{("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14, Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz@hotmail.com","http://www.alandrosenburgcpapc.co.uk")}) ("East Southbourne and Tuckton W",{("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade@hotmail.com","http://www.elliottjohnwesq.co.uk")}) grp = group a by $5 ;

这对我来说不是解决方案。 我已经想到了。

I have data in csv format as shown below.

The data has the below format

"first_name","last_name","company_name","address","city","county","postal","phone1","phone2","email","web"

The sample data named under User.csv. The file contains below data.

"Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14, Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz@hotmail.com","http://www.alandrosenburgcpapc.co.uk" "Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas@gmail.com","http://www.capgeminiamerica.co.uk" "France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade@hotmail.com","http://www.elliottjohnwesq.co.uk"

When I try the same to load using PigStorage

user = LOAD '/home/abhijit/Downloads/User.csv' USING PigStorage(','); DUMP user;

The output of it is like :

("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14 Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz@hotmail.com","http://www.alandrosenburgcpapc.co.uk") ("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas@gmail.com","http://www.capgeminiamerica.co.uk") ("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade@hotmail.com","http://www.elliottjohnwesq.co.uk")

I want to do a group by on city. So I have written

grp = group user by $4; dump grp;

I get the output as :

( Binney St",{("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas@gmail.com","http://www.capgeminiamerica.co.uk")}) ("8 Moor Place",{("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade@hotmail.com","http://www.elliottjohnwesq.co.uk")}) ("St. Stephens Ward",{("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14 Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz@hotmail.com","http://www.alandrosenburgcpapc.co.uk")})

The company_name and address is creating a problem as it contains ',' as part of it. for example "14, Taylor St" in address or "Elliott, John W Esq" in company_name.

so my $4 is treated for "Taylor St" and not the "St. Stephens Ward"

So because of the extra delimiter in the address data or the company_name data is not loaded properly or seperated properly and the group by fuction is not giving correct result.

How can I achieve the group by output as below

("Abbey Ward",{("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas@gmail.com","http://www.capgeminiamerica.co.uk")}) ("St. Stephens Ward",{("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14, Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz@hotmail.com","http://www.alandrosenburgcpapc.co.uk")}) ("East Southbourne and Tuckton W",{("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade@hotmail.com","http://www.elliottjohnwesq.co.uk")}) grp = group a by $5 ;

It won't be the solution for me. I already thought of it.

最满意答案

问题是PigStorage没有考虑转义,因此为不应该是列的字段创建列(每次条目包含逗号)。

使用CSVExcelStorage将解决此问题,因为此存储可以处理转义,从而创建正确的数量和列序列。

The problem is that PigStorage does not take escaping into account, so creates columns for fields that should not be columns (each time an entry contains a comma).

Using CSVExcelStorage will solve this as this storage can deal with escaping, thus creating the right amount and sequence of columns.

更多推荐

本文发布于:2023-08-05 11:22:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1432285.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:group   pig   Apache   function   output

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!