R中文本中任何一组单词的最大出现次数(Maximum occurrence of any set of words in text in R)

编程入门 行业动态 更新时间:2024-10-28 19:31:17
R中文本中任何一组单词的最大出现次数(Maximum occurrence of any set of words in text in R)

给定一组行,我必须找到最多出现的单词(不必是单个单词,也可以是单词集。)

说,我有一个文字,

string <- "He is john beck. john beck is working as an chemical engineer. Most of the chemical engineers are john beck's friend"

我想要输出,

john beck - 3 chemical engineer - 2

有没有这样做的功能或包?

Given a set of lines, I have to find maximum occurrence of words(need not be single word, can be set of words also.)

say, I have a text like,

string <- "He is john beck. john beck is working as an chemical engineer. Most of the chemical engineers are john beck's friend"

I want output to be,

john beck - 3 chemical engineer - 2

Is there any function or package which does this?

最满意答案

尝试这个:

string <- "He is john beck. john beck is working as an chemical engineer. Most of the chemical engineers are john beck's friend" library(tau) library(tm) tokens <- MC_tokenizer(string) tokens <- tokens[tokens != ""] string_ <- paste(stemCompletion(stemDocument(tokens), tokens), collapse = " ") ## if you want only bi-grams: tab <- sort(textcnt(string_, method = "string", n = 2), decreasing = TRUE) data.frame(Freq = tab[tab > 1]) # Freq # john beck 3 # chemical engineer 2 ## if you want uni-, bi- and tri-grams: nmin <- 1; nmax <- 3 tab <- sort(do.call(c, lapply(nmin:nmax, function(x) textcnt(string_, method = "string", n = x) )), decreasing = TRUE) data.frame(Freq = tab[tab > 1]) # Freq # beck 3 # john 3 # john beck 3 # chemical 2 # engineer 2 # is 2 # chemical engineer 2

Try this:

string <- "He is john beck. john beck is working as an chemical engineer. Most of the chemical engineers are john beck's friend" library(tau) library(tm) tokens <- MC_tokenizer(string) tokens <- tokens[tokens != ""] string_ <- paste(stemCompletion(stemDocument(tokens), tokens), collapse = " ") ## if you want only bi-grams: tab <- sort(textcnt(string_, method = "string", n = 2), decreasing = TRUE) data.frame(Freq = tab[tab > 1]) # Freq # john beck 3 # chemical engineer 2 ## if you want uni-, bi- and tri-grams: nmin <- 1; nmax <- 3 tab <- sort(do.call(c, lapply(nmin:nmax, function(x) textcnt(string_, method = "string", n = x) )), decreasing = TRUE) data.frame(Freq = tab[tab > 1]) # Freq # beck 3 # john 3 # john beck 3 # chemical 2 # engineer 2 # is 2 # chemical engineer 2

更多推荐

本文发布于:2023-08-03 14:31:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1390846.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:单词   次数   Maximum   文本中   words

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!