我想将单个基因组区间连接到共同区域。
我的意见:
dfin <- "chr start end sample type 1 10 20 NE1 loss 1 5 15 NE2 gain 1 25 30 NE1 gain 2 40 50 NE1 loss 2 40 60 NE2 loss 3 20 30 NE1 gain" dfin <- read.table(text=dfin, header=T)我的预期产量:
dfout <- "chr start end samples type 1 5 20 NE1-NE2 both 1 25 30 NE1 gain 2 40 60 NE1-NE2 loss 3 20 30 NE1 gain" dfout <- read.table(text=dfout, header=T)dfin的间隔永远不会在同一动物中重叠,只是在动物之间(分别是sample和samples )。 列type在dfin有两个因素( loss和gain ),并且预计在dfin中有三个因素( loss , gain和both ,当dfout的连接区域基于loss和gain )。
有什么想法来处理吗?
*更新了@David Arenburg
I would like to concatenate individual genomic intervals into common regions.
My input:
dfin <- "chr start end sample type 1 10 20 NE1 loss 1 5 15 NE2 gain 1 25 30 NE1 gain 2 40 50 NE1 loss 2 40 60 NE2 loss 3 20 30 NE1 gain" dfin <- read.table(text=dfin, header=T)My expected output:
dfout <- "chr start end samples type 1 5 20 NE1-NE2 both 1 25 30 NE1 gain 2 40 60 NE1-NE2 loss 3 20 30 NE1 gain" dfout <- read.table(text=dfout, header=T)The intervals in dfin will never overlap in the same animal, just between animals (columns sample and samples, respectively). The column type have two factors (lossand gain) in dfin and is expected to have three factors in dfout (loss, gain and both, which occur when the concatenated region in dfout was based on both loss and gain).
Any idea to deal with that?
*Updated for @David Arenburg
最满意答案
(扩展评论)你可以使用“IRanges”包:
library(IRanges) #build an appropriate object dat = RangedData(space = dfin$chr, IRanges(dfin$start, dfin$end), sample = dfin$sample, type = dfin$type) dat #concatenate overlaps with an extra step of saving the concatenation mappings ans = RangedData(reduce(ranges(dat), with.revmap = TRUE)) ans无法弄清楚如何避免reduce “RangedData”对象的丢失列,但保存映射我们可以做类似的事情(可能有一个更合适的 - 根据“IRanges” - 提取映射的方式,但我无法做到找到它):
tmp = elementMetadata(ranges(ans)@unlistData)$revmap@partitioning maps = rep(seq_along(start(tmp)), width(tmp)) maps #[1] 1 1 2 3 3 4有了间隔连接的映射,我们可以聚合“sample”和“type”来获得最终形式。 例如:
tapply(dfin$sample, maps, function(X) paste(unique(X), collapse = "-")) # 1 2 3 4 #"NE1-NE2" "NE1" "NE1-NE2" "NE1"(expanding on the comment) You could use the "IRanges" package:
library(IRanges) #build an appropriate object dat = RangedData(space = dfin$chr, IRanges(dfin$start, dfin$end), sample = dfin$sample, type = dfin$type) dat #concatenate overlaps with an extra step of saving the concatenation mappings ans = RangedData(reduce(ranges(dat), with.revmap = TRUE)) ansCould not figure out how to avoid reduce losing columns of "RangedData" object, but having the mappings saved we could do something like (there might be a more appropriate -according to "IRanges"- way to extract mappings, but I was not able to find it):
tmp = elementMetadata(ranges(ans)@unlistData)$revmap@partitioning maps = rep(seq_along(start(tmp)), width(tmp)) maps #[1] 1 1 2 3 3 4Having the mappings of interval concatenation, we could aggregate "sample" and "type" to get the final form. E.g.:
tapply(dfin$sample, maps, function(X) paste(unique(X), collapse = "-")) # 1 2 3 4 #"NE1-NE2" "NE1" "NE1-NE2" "NE1"更多推荐
发布评论