我有一个包含两列联系字符串的数据框.在一列(名为 probes)中,我有重复的案例(即具有相同字符串的多个案例).对于探测中的每个案例,我想找到包含相同字符串的所有案例,然后将第二列(名为 genes)中所有相应案例的值合并为一个案例.例如,如果我有这样的结构:
I have a data frame with two columns contacting character strings. in one column (named probes) I have duplicated cases (that is, several cases with the same character string). for each case in probes I want to find all the cases containing the same string, and then merge the values of all the corresponding cases in the second column (named genes) into a single case. for example, if I have this structure:
probes genes 1 cg00050873 TSPY4 2 cg00061679 DAZ1 3 cg00061679 DAZ4 4 cg00061679 DAZ4我想改成这样的结构:
probes genes 1 cg00050873 TSPY4 2 cg00061679 DAZ1 DAZ4 DAZ4显然使用 which 对单个探针执行此操作没有问题,然后粘贴和折叠
obviously there is no problem doing this for a single probe using which, and then paste and collapse
ind<-which(olap$probes=="cg00061679") genename<-(olap[ind,2]) genecomb<-paste(genename[1:length(genename)], collapse=" ")但我不确定如何在整个数据帧中提取探针列中重复项的索引.有什么想法吗?
but I'm not sure how to extract the indices of the duplicates in probes column across the whole data frame. any ideas?
提前致谢
推荐答案可以在base R中使用tapply
You can use tapply in base R
data.frame(probes=unique(olap$probes), genes=tapply(olap$genes, olap$probes, paste, collapse=" "))或使用 plyr:
library(plyr) ddply(olap, "probes", summarize, genes = paste(genes, collapse=" "))更新
在第一个版本中这样做可能更安全:
It's probably safer in the first version to do this:
tmp <- tapply(olap$genes, olap$probes, paste, collapse=" ") data.frame(probes=names(tmp), genes=tmp)以防万一 unique 以与 tapply 不同的顺序提供探针.就我个人而言,我总是使用 ddply.
Just in case unique gives the probes in a different order to tapply. Personally I would always use ddply.
更多推荐
R在一列中查找重复项并在第二列中折叠
发布评论