拆分字符串并返回唯一值[closed](Split a string and return the unique values [closed])

我有这样的字符串列表：

D<-c("0,0,0,0,0,0,0", "0,0,0,0,0,0,0,", "0,20,0,0,0,30,0", "0,60,61,70,0,0,","0,1,1,0,0,0,0,")

我想结束这个简化版本，每个字符串只有唯一的值。

D2<-c("0","0","0,20,30","0,60,61,70","0,1")

我试过用strsplit和独特的组合循环播放，但最终得到了一堆NA。

I have a list of strings like this:

D<-c("0,0,0,0,0,0,0", "0,0,0,0,0,0,0,", "0,20,0,0,0,30,0", "0,60,61,70,0,0,","0,1,1,0,0,0,0,")

I'd like to end up with a condensed version of this, with only the unique values for each string.

D2<-c("0","0","0,20,30","0,60,61,70","0,1")

I've tried looping through with a combination of strsplit and unique, but end up with a bunch of NA's.

最满意答案

这个问题已经吸引了三个答案，但即将被关闭。那么，在他的评论中，由电邮提供的最好的解决办法是：

sapply(strsplit(D, ","), function(x) paste(unique(x), collapse = ",")) #[1] "0" "0" "0,20,30" "0,60,61,70" "0,1"

数据

正如OP所指出的那样：

D < -c("0,0,0,0,0,0,0", "0,0,0,0,0,0,0,", "0,20,0,0,0,30,0", "0,60,61,70,0,0,","0,1,1,0,0,0,0,")

基准

一个小基准

library(stringr) microbenchmark::microbenchmark( thelatemail = sapply(strsplit(D, ","), function(x) paste(unique(x), collapse = ",")), epi99 = D %>% sapply(str_split, ",") %>% sapply(unique) %>% sapply(paste, collapse=","), trungnt37 = { out <- c() for(i in 1:length(D)){ k <- strsplit(x = D[i], split = ",") m <- paste(unique(unlist(k)), collapse = ",") out <- c(out, m) } out } )

表明，电邮的答案是最快的：

#Unit: microseconds # expr min lq mean median uq max neval # thelatemail 57.770 61.9240 72.63590 67.9655 75.705 151.789 100 # epi99 318.679 338.5020 383.76284 362.6670 410.054 781.972 100 # trungnt37 74.384 81.3695 96.77465 87.7885 102.702 240.897 100

请注意， epi99的stringr方法不会返回期望的结果，因为它具有尾随逗号。

This question has attracted already three answers but is about to be closed. The best solution IMHO provided by thelatemail in his comment would be missing then:

sapply(strsplit(D, ","), function(x) paste(unique(x), collapse = ",")) #[1] "0" "0" "0,20,30" "0,60,61,70" "0,1"

Data

As given by the OP:

D < -c("0,0,0,0,0,0,0", "0,0,0,0,0,0,0,", "0,20,0,0,0,30,0", "0,60,61,70,0,0,","0,1,1,0,0,0,0,")

Benchmark

A small benchmark

library(stringr) microbenchmark::microbenchmark( thelatemail = sapply(strsplit(D, ","), function(x) paste(unique(x), collapse = ",")), epi99 = D %>% sapply(str_split, ",") %>% sapply(unique) %>% sapply(paste, collapse=","), trungnt37 = { out <- c() for(i in 1:length(D)){ k <- strsplit(x = D[i], split = ",") m <- paste(unique(unlist(k)), collapse = ",") out <- c(out, m) } out } )

shows that thelatemail's answer is the fastest:

#Unit: microseconds # expr min lq mean median uq max neval # thelatemail 57.770 61.9240 72.63590 67.9655 75.705 151.789 100 # epi99 318.679 338.5020 383.76284 362.6670 410.054 781.972 100 # trungnt37 74.384 81.3695 96.77465 87.7885 102.702 240.897 100

Note that epi99's stringr approach doesn't return the expected result as it has trailing commas.

更多推荐