我要更改这个结构:
探针基因 1 cg00050873 TSPY4 2 cg00061679 DAZ1 DAZ4 DAZ4显然没有问题,这样做一个单一的探针使用哪个,然后粘贴和折叠
ind< - 其中(olap $ probes ==cg00061679) genename< ;-( olap [ind,2]) genecomb< -paste(genename [1:length(genename)],collapse =)但我不知道如何在整个数据帧中提取probe列中的重复索引。任何想法?
提前感谢
解决方案code>在基础R中单击
data.frame(probes = unique(olap $探针),基因=自由(olap $ genes,olap $ probes,paste,collapse =))或使用plyr:
library(plyr) ddply(olap,probes总结基因= paste(基因,collapse =))更新
在第一个版本中可能更安全:
只要以独一无二的方式将探测器以不同的顺序发送到 tapply 。我个人总是使用 ddply 。 I have a data frame with two columns contacting character strings. in one column (named probes) I have duplicated cases (that is, several cases with the same character string). for each case in probes I want to find all the cases containing the same string, and then merge the values of all the corresponding cases in the second column (named genes) into a single case.
for example, if I have this structure:
probes genes
1 cg00050873 TSPY4
2 cg00061679 DAZ1
3 cg00061679 DAZ4
4 cg00061679 DAZ4
I want to change it to this structure:
probes genes
1 cg00050873 TSPY4
2 cg00061679 DAZ1 DAZ4 DAZ4
obviously there is no problem doing this for a single probe using which, and then paste and collapse
ind<-which(olap$probes=="cg00061679")
genename<-(olap[ind,2])
genecomb<-paste(genename[1:length(genename)], collapse=" ")
but I'm not sure how to extract the indices of the duplicates in probes column across the whole data frame. any ideas?
Thanks in advance
解决方案 You can use tapply in base R
data.frame(probes=unique(olap$probes),
genes=tapply(olap$genes, olap$probes, paste, collapse=" "))
or use plyr:
library(plyr)
ddply(olap, "probes", summarize, genes = paste(genes, collapse=" "))
UPDATE
It's probably safer in the first version to do this:
tmp <- tapply(olap$genes, olap$probes, paste, collapse=" ")
data.frame(probes=names(tmp), genes=tmp)
Just in case unique gives the probes in a different order to tapply. Personally I would always use ddply.
更多推荐
R在一列中查找重复项,并在第二列中折叠
发布评论