在hadoop中使用正则表达式

编程入门 行业动态 更新时间:2024-10-10 04:19:28
本文介绍了在hadoop中使用正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我有一个包含用户(tweetid,tweets,userid)的CSV文件。

396124436476092416,想想你的生活livin但不要以为这么辛苦它伤害生命是一个真正的礼物,但同样是一个诅se,Obey_Jony09 396124436740317184,@ BleacherReport:万圣节给了我们这个惊人的Derrick Rose照片(通过@ amandakaschube,@ScottStrazzante)t.co/tM0wEugZR1yes,Colten_stamkos 396124436845178880,什么时候12.4k滚动,Matty_T_03

现在我需要写一个Pig查询,返回所有包含'喜欢'一词的tweets,按tweet id排序。

为此,我有以下代码: A = load'/ user / pig / tweets'as(line); B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(。*)[,: - ](。*)[,: - ](。*)')AS(tweetid:long,msg :chararray,userid:chararray); C = filter B by msg matches'。* favorite。*'; D = order C by tweetid;

正则表达式如何在这里以所需的方式分割输出? p>

我尝试使用REGEX_EXTRACT而不是REGEX_EXTRACT_ALL,因为我发现更简单,但不能得到代码工作,除了提取tweets:

B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'[,: - ](。*)[,: - ]',1) :chararray);

上面的别名获取tweets,但如果我使用REGEX_EXTRACT获取tweet_id, o / p: B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'(。*)[,: - ]',1))AS(tweetid:long);

(396124554353197056,Just saw @ samantha0wen and @DakotaFears at the drake concert #waddup)(396124554172432384 ,@ Yutika_Diwadkar我只是那么明亮

I have a CSV file containing user (tweetid, tweets, userid).

396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse",Obey_Jony09 396124436740317184,""@BleacherReport: Halloween has given us this amazing Derrick Rose photo (via @amandakaschube, @ScottStrazzante) t.co/tM0wEugZR1" yes",Colten_stamkos 396124436845178880,"When's 12.4k gonna roll around",Matty_T_03

Now I need to write a Pig Query that returns all the tweets that include the word 'favorite', ordered by tweet id.

For this I have the following code: A = load '/user/pig/tweets' as (line); B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)[,":-](.*)[",:-](.*)')) AS (tweetid:long,msg:chararray,userid:chararray); C = filter B by msg matches '.*favorite.*'; D = order C by tweetid;

How does the regular expression work here in splitting the output in desired way?

I tried using REGEX_EXTRACT instead of REGEX_EXTRACT_ALL as I find that much more simpler, but couldn't get the code working except for extracting just the tweets:

B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'[,":-](.*)[",:-]',1)) AS (msg:chararray);

the above alias gets me the tweets, but if I use REGEX_EXTRACT to get the tweet_id, I do not get the desired o/p: B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'(.*)[,":-]',1)) AS (tweetid:long);

(396124554353197056,"Just saw @samantha0wen and @DakotaFears at the drake concert #waddup") (396124554172432384,"@Yutika_Diwadkar I'm just so bright

更多推荐

在hadoop中使用正则表达式

本文发布于:2023-10-17 09:29:00,感谢您对本站的认可!
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:正则表达式   hadoop

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!