R中文本中任何一组单词的最大出现次数(Maximum occurrence of any set of words in text in R)
给定一组行,我必须找到最多出现的单词(不必是单个单词,也可以是单词集。)
说,我有一个文字,
string <- "He is john beck. john beck is working as an chemical engineer. Most of the chemical engineers are john beck's friend"我想要输出,
john beck - 3 chemical engineer - 2有没有这样做的功能或包?
Given a set of lines, I have to find maximum occurrence of words(need not be single word, can be set of words also.)
say, I have a text like,
string <- "He is john beck. john beck is working as an chemical engineer. Most of the chemical engineers are john beck's friend"I want output to be,
john beck - 3 chemical engineer - 2Is there any function or package which does this?
最满意答案
尝试这个:
string <- "He is john beck. john beck is working as an chemical engineer. Most of the chemical engineers are john beck's friend" library(tau) library(tm) tokens <- MC_tokenizer(string) tokens <- tokens[tokens != ""] string_ <- paste(stemCompletion(stemDocument(tokens), tokens), collapse = " ") ## if you want only bi-grams: tab <- sort(textcnt(string_, method = "string", n = 2), decreasing = TRUE) data.frame(Freq = tab[tab > 1]) # Freq # john beck 3 # chemical engineer 2 ## if you want uni-, bi- and tri-grams: nmin <- 1; nmax <- 3 tab <- sort(do.call(c, lapply(nmin:nmax, function(x) textcnt(string_, method = "string", n = x) )), decreasing = TRUE) data.frame(Freq = tab[tab > 1]) # Freq # beck 3 # john 3 # john beck 3 # chemical 2 # engineer 2 # is 2 # chemical engineer 2Try this:
string <- "He is john beck. john beck is working as an chemical engineer. Most of the chemical engineers are john beck's friend" library(tau) library(tm) tokens <- MC_tokenizer(string) tokens <- tokens[tokens != ""] string_ <- paste(stemCompletion(stemDocument(tokens), tokens), collapse = " ") ## if you want only bi-grams: tab <- sort(textcnt(string_, method = "string", n = 2), decreasing = TRUE) data.frame(Freq = tab[tab > 1]) # Freq # john beck 3 # chemical engineer 2 ## if you want uni-, bi- and tri-grams: nmin <- 1; nmax <- 3 tab <- sort(do.call(c, lapply(nmin:nmax, function(x) textcnt(string_, method = "string", n = x) )), decreasing = TRUE) data.frame(Freq = tab[tab > 1]) # Freq # beck 3 # john 3 # john beck 3 # chemical 2 # engineer 2 # is 2 # chemical engineer 2更多推荐
发布评论