使用R从没有空格或分隔符的字符串中提取单词(Extract words from strings without spaces or delimiters using R)

编程入门 行业动态 更新时间:2024-10-28 11:28:00
使用R从没有空格或分隔符的字符串中提取单词(Extract words from strings without spaces or delimiters using R) babybag - 婴儿包 badshelter - 坏庇护所 他们现代的角落 - 现代化的角落商店 hamptonfamilyguidebook - 汉普顿家庭指南

有没有办法使用R从字符串中提取没有空格或其他分隔符的单词? 我有一个URL列表,我试图找出URL中包含的单词。

input <- c("babybag", "badshelter", "themoderncornerstore", "hamptonfamilyguidebook") babybag - baby bag badshelter - bad shelter themoderncornerstore - the modern corner store hamptonfamilyguidebook - hampton family guide book

Is there a way to use R to extract words from string that do not have spaces or other delimiters? I have a list of URLs and I am trying to figure out what words are included in the URLs.

input <- c("babybag", "badshelter", "themoderncornerstore", "hamptonfamilyguidebook")

最满意答案

这是一个天真的方法可能会给你灵感,我使用库hunspell但你可以测试任何字典的子串。

我从右边开始,尝试每个子串并保持我在字典中找到的最长,然后改变我的起始位置,它很慢,所以我希望你没有4百万个。 hampton不在这本词典中,所以它没有为最后一个给出正确的结果:

split_words <- function(x){ candidate <- x words <- NULL j <- nchar(x) while(j !=0){ word <- NULL for (i in j:1){ candidate <- substr(x,i,j) if(!length(hunspell::hunspell_find(candidate)[[1]])) word <- candidate } if(is.null(word)) return("") words <- c(word,words) j <- j-nchar(word) } words } input <- c("babybag", "badshelter", "themoderncornerstore", "hamptonfamilyguidebook") lapply(input,split_words) # [[1]] # [1] "baby" "bag" # # [[2]] # [1] "bad" "shelter" # # [[3]] # [1] "the" "modern" "corner" "store" # # [[4]] # [1] "h" "amp" "ton" "family" "guidebook" #

这是一个快速修复,手动向字典添加单词:

split_words <- function(x, additional = c("hampton","otherwordstoadd")){ candidate <- x words <- NULL j <- nchar(x) while(j !=0){ word <- NULL for (i in j:1){ candidate <- substr(x,i,j) if(!length(hunspell::hunspell_find(candidate,ignore = additional)[[1]])) word <- candidate } if(is.null(word)) return("") words <- c(word,words) j <- j-nchar(word) } words } input <- c("babybag", "badshelter", "themoderncornerstore", "hamptonfamilyguidebook") lapply(input,split_words) # [[1]] # [1] "baby" "bag" # # [[2]] # [1] "bad" "shelter" # # [[3]] # [1] "the" "modern" "corner" "store" # # [[4]] # [1] "hampton" "family" "guidebook" #

你可以只是交叉手指,不要有任何模棱两可的表达。 请注意, "guidebook"在我的输出中只有一个单词,所以我们在四个示例中已经有了一个边缘情况。

Here is a naive approach that might give you inspiration, I used library hunspell but you could test substrings against any dictionary.

I start from the right, try every substring and keep the longest I can find in the dictionary, then change my starting position, it's quite slow so I hope you don't have 4 millions of those. hampton is not in this dictionary so it doesn't give the right result for the last one :

split_words <- function(x){ candidate <- x words <- NULL j <- nchar(x) while(j !=0){ word <- NULL for (i in j:1){ candidate <- substr(x,i,j) if(!length(hunspell::hunspell_find(candidate)[[1]])) word <- candidate } if(is.null(word)) return("") words <- c(word,words) j <- j-nchar(word) } words } input <- c("babybag", "badshelter", "themoderncornerstore", "hamptonfamilyguidebook") lapply(input,split_words) # [[1]] # [1] "baby" "bag" # # [[2]] # [1] "bad" "shelter" # # [[3]] # [1] "the" "modern" "corner" "store" # # [[4]] # [1] "h" "amp" "ton" "family" "guidebook" #

Here's a quick fix, adding words manually to the dictionary:

split_words <- function(x, additional = c("hampton","otherwordstoadd")){ candidate <- x words <- NULL j <- nchar(x) while(j !=0){ word <- NULL for (i in j:1){ candidate <- substr(x,i,j) if(!length(hunspell::hunspell_find(candidate,ignore = additional)[[1]])) word <- candidate } if(is.null(word)) return("") words <- c(word,words) j <- j-nchar(word) } words } input <- c("babybag", "badshelter", "themoderncornerstore", "hamptonfamilyguidebook") lapply(input,split_words) # [[1]] # [1] "baby" "bag" # # [[2]] # [1] "bad" "shelter" # # [[3]] # [1] "the" "modern" "corner" "store" # # [[4]] # [1] "hampton" "family" "guidebook" #

You can just cross fingers not to have any ambiguous expressions though. Note that "guidebook" is in one word in my output so we already have an edge case in your four examples.

更多推荐

本文发布于:2023-08-04 02:44:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1406755.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:空格   字符串   单词   从没   分隔符

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!