使用R从没有空格或分隔符的字符串中提取单词(Extract words from strings without spaces or delimiters using R)

使用R从没有空格或分隔符的字符串中提取单词(Extract words from strings without spaces or delimiters using R) babybag - 婴儿包 badshelter - 坏庇护所他们现代的角落 - 现代化的角落商店 hamptonfamilyguidebook - 汉普顿家庭指南

有没有办法使用R从字符串中提取没有空格或其他分隔符的单词？我有一个URL列表，我试图找出URL中包含的单词。

input <- c("babybag", "badshelter", "themoderncornerstore", "hamptonfamilyguidebook") babybag - baby bag badshelter - bad shelter themoderncornerstore - the modern corner store hamptonfamilyguidebook - hampton family guide book

Is there a way to use R to extract words from string that do not have spaces or other delimiters? I have a list of URLs and I am trying to figure out what words are included in the URLs.

input <- c("babybag", "badshelter", "themoderncornerstore", "hamptonfamilyguidebook")

最满意答案

这是一个天真的方法可能会给你灵感，我使用库hunspell但你可以测试任何字典的子串。

我从右边开始，尝试每个子串并保持我在字典中找到的最长，然后改变我的起始位置，它很慢，所以我希望你没有4百万个。 hampton不在这本词典中，所以它没有为最后一个给出正确的结果：

split_words <- function(x){ candidate <- x words <- NULL j <- nchar(x) while(j !=0){ word <- NULL for (i in j:1){ candidate <- substr(x,i,j) if(!length(hunspell::hunspell_find(candidate)[[1]])) word <- candidate } if(is.null(word)) return("") words <- c(word,words) j <- j-nchar(word) } words } input <- c("babybag", "badshelter", "themoderncornerstore", "hamptonfamilyguidebook") lapply(input,split_words) # [[1]] # [1] "baby" "bag" # # [[2]] # [1] "bad" "shelter" # # [[3]] # [1] "the" "modern" "corner" "store" # # [[4]] # [1] "h" "amp" "ton" "family" "guidebook" #

这是一个快速修复，手动向字典添加单词：

split_words <- function(x, additional = c("hampton","otherwordstoadd")){ candidate <- x words <- NULL j <- nchar(x) while(j !=0){ word <- NULL for (i in j:1){ candidate <- substr(x,i,j) if(!length(hunspell::hunspell_find(candidate,ignore = additional)[[1]])) word <- candidate } if(is.null(word)) return("") words <- c(word,words) j <- j-nchar(word) } words } input <- c("babybag", "badshelter", "themoderncornerstore", "hamptonfamilyguidebook") lapply(input,split_words) # [[1]] # [1] "baby" "bag" # # [[2]] # [1] "bad" "shelter" # # [[3]] # [1] "the" "modern" "corner" "store" # # [[4]] # [1] "hampton" "family" "guidebook" #

你可以只是交叉手指，不要有任何模棱两可的表达。请注意， "guidebook"在我的输出中只有一个单词，所以我们在四个示例中已经有了一个边缘情况。

Here is a naive approach that might give you inspiration, I used library hunspell but you could test substrings against any dictionary.

I start from the right, try every substring and keep the longest I can find in the dictionary, then change my starting position, it's quite slow so I hope you don't have 4 millions of those. hampton is not in this dictionary so it doesn't give the right result for the last one :

split_words <- function(x){ candidate <- x words <- NULL j <- nchar(x) while(j !=0){ word <- NULL for (i in j:1){ candidate <- substr(x,i,j) if(!length(hunspell::hunspell_find(candidate)[[1]])) word <- candidate } if(is.null(word)) return("") words <- c(word,words) j <- j-nchar(word) } words } input <- c("babybag", "badshelter", "themoderncornerstore", "hamptonfamilyguidebook") lapply(input,split_words) # [[1]] # [1] "baby" "bag" # # [[2]] # [1] "bad" "shelter" # # [[3]] # [1] "the" "modern" "corner" "store" # # [[4]] # [1] "h" "amp" "ton" "family" "guidebook" #

Here's a quick fix, adding words manually to the dictionary:

split_words <- function(x, additional = c("hampton","otherwordstoadd")){ candidate <- x words <- NULL j <- nchar(x) while(j !=0){ word <- NULL for (i in j:1){ candidate <- substr(x,i,j) if(!length(hunspell::hunspell_find(candidate,ignore = additional)[[1]])) word <- candidate } if(is.null(word)) return("") words <- c(word,words) j <- j-nchar(word) } words } input <- c("babybag", "badshelter", "themoderncornerstore", "hamptonfamilyguidebook") lapply(input,split_words) # [[1]] # [1] "baby" "bag" # # [[2]] # [1] "bad" "shelter" # # [[3]] # [1] "the" "modern" "corner" "store" # # [[4]] # [1] "hampton" "family" "guidebook" #

You can just cross fingers not to have any ambiguous expressions though. Note that "guidebook" is in one word in my output so we already have an edge case in your four examples.

更多推荐