在hadoop中使用正则表达式

编程入门行业动态更新时间:2024-10-10 04:19:28

本文介绍了在hadoop中使用正则表达式的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我有一个包含用户（tweetid，tweets，userid）的CSV文件。

396124436476092416，想想你的生活livin但不要以为这么辛苦它伤害生命是一个真正的礼物，但同样是一个诅se，Obey_Jony09 396124436740317184，@ BleacherReport：万圣节给了我们这个惊人的Derrick Rose照片（通过@ amandakaschube，@ScottStrazzante）t.co/tM0wEugZR1yes，Colten_stamkos 396124436845178880，什么时候12.4k滚动，Matty_T_03

现在我需要写一个Pig查询，返回所有包含'喜欢'一词的tweets，按tweet id排序。

为此，我有以下代码： A = load'/ user / pig / tweets'as（line）; B = FOREACH A GENERATE FLATTEN（REGEX_EXTRACT_ALL（line，'（。*）[，： - ]（。*）[，： - ]（。*）'）AS（tweetid：long，msg ：chararray，userid：chararray）; C = filter B by msg matches'。* favorite。*'; D = order C by tweetid;

正则表达式如何在这里以所需的方式分割输出？ p>

我尝试使用REGEX_EXTRACT而不是REGEX_EXTRACT_ALL，因为我发现更简单，但不能得到代码工作，除了提取tweets：

B = FOREACH A GENERATE FLATTEN（REGEX_EXTRACT（line，'[，： - ]（。*）[，： - ]'，1）：chararray）;

上面的别名获取tweets，但如果我使用REGEX_EXTRACT获取tweet_id， o / p： B = FOREACH A GENERATE FLATTEN（REGEX_EXTRACT（line，'（。*）[，： - ]'，1））AS（tweetid：long）;

（396124554353197056，Just saw @ samantha0wen and @DakotaFears at the drake concert #waddup）（396124554172432384 ，@ Yutika_Diwadkar我只是那么明亮
I have a CSV file containing user (tweetid, tweets, userid).
396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse",Obey_Jony09 396124436740317184,""@BleacherReport: Halloween has given us this amazing Derrick Rose photo (via @amandakaschube, @ScottStrazzante) t.co/tM0wEugZR1" yes",Colten_stamkos 396124436845178880,"When's 12.4k gonna roll around",Matty_T_03
Now I need to write a Pig Query that returns all the tweets that include the word 'favorite', ordered by tweet id.

For this I have the following code: A = load '/user/pig/tweets' as (line); B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)[,":-](.*)[",:-](.*)')) AS (tweetid:long,msg:chararray,userid:chararray); C = filter B by msg matches '.*favorite.*'; D = order C by tweetid;

How does the regular expression work here in splitting the output in desired way?

I tried using REGEX_EXTRACT instead of REGEX_EXTRACT_ALL as I find that much more simpler, but couldn't get the code working except for extracting just the tweets:

B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'[,":-](.*)[",:-]',1)) AS (msg:chararray);

the above alias gets me the tweets, but if I use REGEX_EXTRACT to get the tweet_id, I do not get the desired o/p: B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'(.*)[,":-]',1)) AS (tweetid:long);
(396124554353197056,"Just saw @samantha0wen and @DakotaFears at the drake concert #waddup") (396124554172432384,"@Yutika_Diwadkar I'm just so bright

更多推荐

在hadoop中使用正则表达式

本文发布于:2023-10-17 09:29:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1500472.html

版权声明:本站内容均来自互联网，仅供演示用，请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系，我们将在24小时内删除。

正则表达式 hadoop

上一篇： ARM Linux上启动的一些细节

下一篇： Win 10系统当前窗口失去焦点/经常弹出（无故跳出当前窗口）的解决方法

发布评论取消回复

评论列表（有 0 条评论）

最近发表

荆门网站建设的重要性

win10蓝屏终止代码CRITICAL_PROCESS_DIED解决方法

您可以尝试添加 --skip-broken 选项来解决该问题您可以尝试执行：rpm -Va --nofiles --nodigest 解决方案

关于无线网络波动大的解决办法

Windows10 关于系统中断CPU占用过高导致电脑变卡的解决办法

VS 2019 点击页面自动定位到解决方案资源管理器目录位置

（亲测解决）VMware打开需要半天才进入、打开系统很慢、运行很慢解决办法

Typora官网下载的最新版本mac10.13以下版本用不了的解决办法

成功解决ModuleNotFoundError: No module named ‘torch._C‘

MySQL:由于找不到VCRUNTIME140_1.dll，无法继续执行代码。重新安装程序可能会解决此问题

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍！

热门文章

从源“http://localhost:5173”访问“...”处的 XMLHttpRequest 已被 CORS 策略阻止

币安API错误代码1102，未发送强制参数“时间戳”

如果我在bot telegram nodejs中使用editMessageMedia，我如何制作标题

在 Node.js 中从网络流创建 blob

使用 Node.js / ES6 如何设置 dotenv 文件的自定义路径？

使用 NODE.JS 和 html5 实现低延迟（50 毫秒）视频流

如何从nodejs连接laravel>laravel

使用nodejs观看目录

如果文件包含特定字符串，如何跳过 GitHub 工作流程步骤？

FirebaseError：无法从.env加载环境变量

标签列表

文件

如何在

Python

系统

java

方法

数据

错误

windows

函数

android

linux

教程

如何使用

代码

字符串

计算机

电脑

服务器

NET

应用程序

数组

PHP

MySQL

SQL

对象

项目

程序

数据库

word