将个体基因组区间连接到种群区域(Concatenate individual genomic intervals into populational regions)

编程入门 行业动态 更新时间:2024-10-18 01:31:06
将个体基因组区间连接到种群区域(Concatenate individual genomic intervals into populational regions)

我想将单个基因组区间连接到共同区域。

我的意见:

dfin <- "chr start end sample type 1 10 20 NE1 loss 1 5 15 NE2 gain 1 25 30 NE1 gain 2 40 50 NE1 loss 2 40 60 NE2 loss 3 20 30 NE1 gain" dfin <- read.table(text=dfin, header=T)

我的预期产量:

dfout <- "chr start end samples type 1 5 20 NE1-NE2 both 1 25 30 NE1 gain 2 40 60 NE1-NE2 loss 3 20 30 NE1 gain" dfout <- read.table(text=dfout, header=T)

dfin的间隔永远不会在同一动物中重叠,只是在动物之间(分别是sample和samples )。 列type在dfin有两个因素( loss和gain ),并且预计在dfin中有三个因素( loss , gain和both ,当dfout的连接区域基于loss和gain )。

有什么想法来处理吗?

*更新了@David Arenburg

I would like to concatenate individual genomic intervals into common regions.

My input:

dfin <- "chr start end sample type 1 10 20 NE1 loss 1 5 15 NE2 gain 1 25 30 NE1 gain 2 40 50 NE1 loss 2 40 60 NE2 loss 3 20 30 NE1 gain" dfin <- read.table(text=dfin, header=T)

My expected output:

dfout <- "chr start end samples type 1 5 20 NE1-NE2 both 1 25 30 NE1 gain 2 40 60 NE1-NE2 loss 3 20 30 NE1 gain" dfout <- read.table(text=dfout, header=T)

The intervals in dfin will never overlap in the same animal, just between animals (columns sample and samples, respectively). The column type have two factors (lossand gain) in dfin and is expected to have three factors in dfout (loss, gain and both, which occur when the concatenated region in dfout was based on both loss and gain).

Any idea to deal with that?

*Updated for @David Arenburg

最满意答案

(扩展评论)你可以使用“IRanges”包:

library(IRanges) #build an appropriate object dat = RangedData(space = dfin$chr, IRanges(dfin$start, dfin$end), sample = dfin$sample, type = dfin$type) dat #concatenate overlaps with an extra step of saving the concatenation mappings ans = RangedData(reduce(ranges(dat), with.revmap = TRUE)) ans

无法弄清楚如何避免reduce “RangedData”对象的丢失列,但保存映射我们可以做类似的事情(可能有一个更合适的 - 根据“IRanges” - 提取映射的方式,但我无法做到找到它):

tmp = elementMetadata(ranges(ans)@unlistData)$revmap@partitioning maps = rep(seq_along(start(tmp)), width(tmp)) maps #[1] 1 1 2 3 3 4

有了间隔连接的映射,我们可以聚合“sample”和“type”来获得最终形式。 例如:

tapply(dfin$sample, maps, function(X) paste(unique(X), collapse = "-")) # 1 2 3 4 #"NE1-NE2" "NE1" "NE1-NE2" "NE1"

(expanding on the comment) You could use the "IRanges" package:

library(IRanges) #build an appropriate object dat = RangedData(space = dfin$chr, IRanges(dfin$start, dfin$end), sample = dfin$sample, type = dfin$type) dat #concatenate overlaps with an extra step of saving the concatenation mappings ans = RangedData(reduce(ranges(dat), with.revmap = TRUE)) ans

Could not figure out how to avoid reduce losing columns of "RangedData" object, but having the mappings saved we could do something like (there might be a more appropriate -according to "IRanges"- way to extract mappings, but I was not able to find it):

tmp = elementMetadata(ranges(ans)@unlistData)$revmap@partitioning maps = rep(seq_along(start(tmp)), width(tmp)) maps #[1] 1 1 2 3 3 4

Having the mappings of interval concatenation, we could aggregate "sample" and "type" to get the final form. E.g.:

tapply(dfin$sample, maps, function(X) paste(unique(X), collapse = "-")) # 1 2 3 4 #"NE1-NE2" "NE1" "NE1-NE2" "NE1"

更多推荐

本文发布于:2023-04-28 06:17:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1330343.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:基因组   种群   区间   连接到   个体

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!