用Hive计算Text变量的单词频率

编程入门 行业动态 更新时间:2024-10-28 07:22:48
本文介绍了用Hive计算Text变量的单词频率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我有一个变量,每一行都是一个句子。 示例:

I have a variable that every row is a sentence. Example:

-Row1 "Hey, how are you? -Rwo2 "Hey, Who is there?

我希望输出的结果是逐字计数。

I want that the output is the count group by word.

示例:

Hey 2 How 1 are 1 ...

我正在使用分割功能,但是有点卡住了。对此有任何想法吗?

I am using split a bit funtion but I am a bit stuck. Any thoughts on this?

谢谢!

推荐答案

这是可能的在蜂巢。按非字母字符分割并使用横向视图+爆炸,然后计算单词:

This is possible in Hive. Split by non-alpha characters and use lateral view+explode, then count words:

with your_data as( select stack(2, 'Hey, how are you?', 'Hey, Who is there?' ) as initial_string ) select w.word, count(*) cnt from ( select split(lower(initial_string),'[^a-zA-Z]+') words from your_data )s lateral view explode(words) w as word where w.word!='' group by w.word;

结果:

word cnt are 1 hey 2 how 1 is 1 there 1 who 1 you 1

使用语句函数,它将返回标记化句子的数组(单词数组):

One more method using sentences function, it returns array of tokenized sentences (array of array of words):

with your_data as( select stack(2, 'Hey, how are you?', 'Hey, Who is there?' ) as initial_string ) select w.word, count(*) cnt from ( select sentences(lower(initial_string)) sentences from your_data )d lateral view explode(sentences) s as sentence lateral view explode(s.sentence) w as word group by w.word;

结果:

word cnt are 1 hey 2 how 1 is 1 there 1 who 1 you 1

句子(字符串str,字符串lang,字符串语言环境)函数将一串自然语言文本标记为单词和句子,其中每个句子在适当的位置被打断句子边界并作为单词数组返回。 lang和 locale是可选参数。例如,句子(你好!你好吗?)返回((你好,那里),(如何,是,你))

sentences(string str, string lang, string locale) function tokenizes a string of natural language text into words and sentences, where each sentence is broken at the appropriate sentence boundary and returned as an array of words. The 'lang' and 'locale' are optional arguments. For example, sentences('Hello there! How are you?') returns ( ("Hello", "there"), ("How", "are", "you") )

更多推荐

用Hive计算Text变量的单词频率

本文发布于:2023-11-12 02:52:20,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1580296.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:变量   单词   频率   Hive   Text

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!