R中文本中任何一组单词的最大出现次数(Maximum occurrence of any set of words in text in R)

给定一组行，我必须找到最多出现的单词（不必是单个单词，也可以是单词集。）

说，我有一个文字，

string <- "He is john beck. john beck is working as an chemical engineer. Most of the chemical engineers are john beck's friend"

我想要输出，

john beck - 3 chemical engineer - 2

有没有这样做的功能或包？

Given a set of lines, I have to find maximum occurrence of words(need not be single word, can be set of words also.)

say, I have a text like,

string <- "He is john beck. john beck is working as an chemical engineer. Most of the chemical engineers are john beck's friend"

I want output to be,

john beck - 3 chemical engineer - 2

Is there any function or package which does this?

最满意答案

尝试这个：

string <- "He is john beck. john beck is working as an chemical engineer. Most of the chemical engineers are john beck's friend" library(tau) library(tm) tokens <- MC_tokenizer(string) tokens <- tokens[tokens != ""] string_ <- paste(stemCompletion(stemDocument(tokens), tokens), collapse = " ") ## if you want only bi-grams: tab <- sort(textcnt(string_, method = "string", n = 2), decreasing = TRUE) data.frame(Freq = tab[tab > 1]) # Freq # john beck 3 # chemical engineer 2 ## if you want uni-, bi- and tri-grams: nmin <- 1; nmax <- 3 tab <- sort(do.call(c, lapply(nmin:nmax, function(x) textcnt(string_, method = "string", n = x) )), decreasing = TRUE) data.frame(Freq = tab[tab > 1]) # Freq # beck 3 # john 3 # john beck 3 # chemical 2 # engineer 2 # is 2 # chemical engineer 2

Try this:

string <- "He is john beck. john beck is working as an chemical engineer. Most of the chemical engineers are john beck's friend" library(tau) library(tm) tokens <- MC_tokenizer(string) tokens <- tokens[tokens != ""] string_ <- paste(stemCompletion(stemDocument(tokens), tokens), collapse = " ") ## if you want only bi-grams: tab <- sort(textcnt(string_, method = "string", n = 2), decreasing = TRUE) data.frame(Freq = tab[tab > 1]) # Freq # john beck 3 # chemical engineer 2 ## if you want uni-, bi- and tri-grams: nmin <- 1; nmax <- 3 tab <- sort(do.call(c, lapply(nmin:nmax, function(x) textcnt(string_, method = "string", n = x) )), decreasing = TRUE) data.frame(Freq = tab[tab > 1]) # Freq # beck 3 # john 3 # john beck 3 # chemical 2 # engineer 2 # is 2 # chemical engineer 2

更多推荐

R中文本中任何一组单词的最大出现次数(Maximum occurrence of any set of words in text in R)

最满意答案

发布评论取消回复

最近发表

热门文章

标签列表