从基因组创建树状图(Create a Dendogram from Genome)

编程入门 行业动态 更新时间:2024-10-27 12:34:52
基因组创建树状图(Create a Dendogram from Genome)

我想玩基因组数据:

Species_A = ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag Species_B = ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag Species_C = ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag Species_D = ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg Species_E = ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag

我想根据上面的基因组序列,根据这些生物彼此之间的相关程度来创建一个树形图。 我先做的是计算每个物种的a,c,t和g的数量然后我创建了一个数组,然后绘制了一个树状图:

gen_size1 = len(Species_A) a1 = float(Species_A.count('a'))/float(gen_size1) c1 = float(Species_A.count('c'))/float(gen_size1) g1 = float(Species_A.count('g'))/float(gen_size1) t1 = float(Species_A.count('t'))/float(gen_size1) . . . gen_size5 = len(Species_E) a5 = float(Species_E.count('a'))/float(gen_size5) c5 = float(Species_E.count('c'))/float(gen_size5) g5 = float(Species_E.count('g'))/float(gen_size5) t5 = float(Species_E.count('t'))/float(gen_size5) my_genes = np.array([[a1,c1,g1,t1],[a2,c2,g2,t2],[a3,c3,g3,t3],[a4,c4,g4,t4],[a5,c5,g5,t5]]) plt.subplot(1,2,1) plt.title("Mononucleotide") linkage_matrix = linkage(my_genes, "single") print linkage_matrix dendrogram(linkage_matrix,truncate_mode='lastp', color_threshold=1, labels=[Species_A, Species_B, Species_C, Species_D, Species_E], show_leaf_counts=True) plt.show()

物种A和B是同一生物的变体,我期望两者都应该与根的共同进化枝不同,物种C和D同样如此,它们应该从根的另一个共同进化枝发散,然后物种E偏离主根,因为它与物种A到D无关。不幸的是,树状图结果与物种A和E混合,从一个普通的进化枝发散,然后是另一个进化枝中的物种C,D和B(相当混乱)。

我已经阅读了基因组序列的层次聚类,但我发现它只适用于二维系统,不幸的是我有4个维度,即a,c,t和g。 还有其他策略吗? 谢谢您的帮助!

I wanted to play around with genomic data:

Species_A = ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag Species_B = ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag Species_C = ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag Species_D = ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg Species_E = ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag

I wanted to create a dendrogram based on how close these organisms are related to each other given the genome sequence above. What I did first was to count the number of a's, c's, t's and g's of each species then I created an array, then plotted a dendrogram:

gen_size1 = len(Species_A) a1 = float(Species_A.count('a'))/float(gen_size1) c1 = float(Species_A.count('c'))/float(gen_size1) g1 = float(Species_A.count('g'))/float(gen_size1) t1 = float(Species_A.count('t'))/float(gen_size1) . . . gen_size5 = len(Species_E) a5 = float(Species_E.count('a'))/float(gen_size5) c5 = float(Species_E.count('c'))/float(gen_size5) g5 = float(Species_E.count('g'))/float(gen_size5) t5 = float(Species_E.count('t'))/float(gen_size5) my_genes = np.array([[a1,c1,g1,t1],[a2,c2,g2,t2],[a3,c3,g3,t3],[a4,c4,g4,t4],[a5,c5,g5,t5]]) plt.subplot(1,2,1) plt.title("Mononucleotide") linkage_matrix = linkage(my_genes, "single") print linkage_matrix dendrogram(linkage_matrix,truncate_mode='lastp', color_threshold=1, labels=[Species_A, Species_B, Species_C, Species_D, Species_E], show_leaf_counts=True) plt.show()

Species A and B are variants of the same organism and I am expecting that both should diverge from a common clade form the root, same goes with Species C and D which should diverge from another common clade from the root then with Species E diverging from the main root because it is not related to Species A to D. Unfortunately the dendrogram result was mixed up with Species A and E diverging from a common clade, then Species C, D and B in another clade (pretty messed up).

I have read about hierarchical clustering for genome sequence but I have observed that it only accommodates 2 dimensional system, unfortunately I have 4 dimensions which are a,c,t and g. Any other strategy for this? thanks for the help!

最满意答案

这是生物信息学中相当普遍的问题,因此您应该使用BioPython这样的生物信息库来构建此功能。

首先,使用序列创建一个多FASTA文件:

import os from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.Alphabet import generic_dna sequences = ['ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag', 'ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag', 'ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag', 'ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg', 'ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag'] my_records = [SeqRecord(Seq(sequence, generic_dna), id='Species_{}'.format(letter), description='Species_{}'.format(letter)) for sequence, letter in zip(sequences, 'ABCDE')] root_dir = r"C:\Users\BioGeek\Documents\temp" filename = 'my_sequences' fasta_path = os.path.join(root_dir, '{}.fasta'.format(filename)) SeqIO.write(my_records, fasta_path, "fasta")

这将创建如下所示的文件C:\Users\BioGeek\Documents\temp\my_sequences.fasta :

>Species_A ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag >Species_B ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag >Species_C ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag >Species_D ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg >Species_E ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag

接下来,使用命令行工具ClustalW执行多序列对齐:

from Bio.Align.Applications import ClustalwCommandline clustalw_exe = r"C:\path\to\clustalw-2.1\clustalw2.exe" assert os.path.isfile(clustalw_exe), "Clustal W executable missing" clustalw_cline = ClustalwCommandline(clustalw_exe, infile=fasta_path) stdout, stderr = clustalw_cline() print stdout

这打印:

CLUSTAL 2.1 Multiple Sequence Alignments Sequence format is Pearson Sequence 1: Species_A 50 bp Sequence 2: Species_B 50 bp Sequence 3: Species_C 50 bp Sequence 4: Species_D 50 bp Sequence 5: Species_E 50 bp Start of Pairwise alignments Aligning... Sequences (1:2) Aligned. Score: 90 Sequences (1:3) Aligned. Score: 94 Sequences (1:4) Aligned. Score: 88 Sequences (1:5) Aligned. Score: 84 Sequences (2:3) Aligned. Score: 90 Sequences (2:4) Aligned. Score: 84 Sequences (2:5) Aligned. Score: 78 Sequences (3:4) Aligned. Score: 94 Sequences (3:5) Aligned. Score: 82 Sequences (4:5) Aligned. Score: 82 Guide tree file created: [C:\Users\BioGeek\Documents\temp\my_sequences.dnd] There are 4 groups Start of Multiple Alignment Aligning... Group 1: Sequences: 2 Score:912 Group 2: Sequences: 2 Score:921 Group 3: Sequences: 4 Score:865 Group 4: Sequences: 5 Score:855 Alignment Score 2975 CLUSTAL-Alignment file created [C:\Users\BioGeek\Documents\temp\my_sequences.aln]

ClustalW创建的my_sequences.dnd文件是标准的Newick树文件 , Bio.Phylo可以解析这些:

from Bio import Phylo newick_path = os.path.join(root_dir, '{}.dnd'.format(filename)) tree = Phylo.read(newick_path, "newick") Phylo.draw_ascii(tree)

哪个印刷品:

____________ Species_A ____| | |_____________________________________ Species_B | _| ____ Species_C |_________| | |_________________________ Species_D | |__________________________________________________________________ Species_E

或者,如果安装了matplotlib或pylab ,则可以使用draw函数创建图形:

tree.rooted = True Phylo.draw(tree, branch_labels=lambda c: c.branch_length)

产生:

树状图

这个树状图清楚地说明了你观察到的东西:物种A和B是同一生物的变种,并且都与根部的共同进化枝发散。 与物种C和D相同,两者都与根部的另一个共同分支不同。 最后,物种E偏离主根,因为它与物种A到D无关。

This is a fairly common problem in bioinformatics, so you should use a bioinformatics library like BioPython that has this functionality builtin.

First you create a multi FASTA file with your sequences:

import os from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.Alphabet import generic_dna sequences = ['ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag', 'ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag', 'ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag', 'ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg', 'ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag'] my_records = [SeqRecord(Seq(sequence, generic_dna), id='Species_{}'.format(letter), description='Species_{}'.format(letter)) for sequence, letter in zip(sequences, 'ABCDE')] root_dir = r"C:\Users\BioGeek\Documents\temp" filename = 'my_sequences' fasta_path = os.path.join(root_dir, '{}.fasta'.format(filename)) SeqIO.write(my_records, fasta_path, "fasta")

This creates the file C:\Users\BioGeek\Documents\temp\my_sequences.fasta that looks like this:

>Species_A ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag >Species_B ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag >Species_C ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag >Species_D ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg >Species_E ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag

Next, use the command line tool ClustalW to do a multiple sequence alignment:

from Bio.Align.Applications import ClustalwCommandline clustalw_exe = r"C:\path\to\clustalw-2.1\clustalw2.exe" assert os.path.isfile(clustalw_exe), "Clustal W executable missing" clustalw_cline = ClustalwCommandline(clustalw_exe, infile=fasta_path) stdout, stderr = clustalw_cline() print stdout

This prints:

CLUSTAL 2.1 Multiple Sequence Alignments Sequence format is Pearson Sequence 1: Species_A 50 bp Sequence 2: Species_B 50 bp Sequence 3: Species_C 50 bp Sequence 4: Species_D 50 bp Sequence 5: Species_E 50 bp Start of Pairwise alignments Aligning... Sequences (1:2) Aligned. Score: 90 Sequences (1:3) Aligned. Score: 94 Sequences (1:4) Aligned. Score: 88 Sequences (1:5) Aligned. Score: 84 Sequences (2:3) Aligned. Score: 90 Sequences (2:4) Aligned. Score: 84 Sequences (2:5) Aligned. Score: 78 Sequences (3:4) Aligned. Score: 94 Sequences (3:5) Aligned. Score: 82 Sequences (4:5) Aligned. Score: 82 Guide tree file created: [C:\Users\BioGeek\Documents\temp\my_sequences.dnd] There are 4 groups Start of Multiple Alignment Aligning... Group 1: Sequences: 2 Score:912 Group 2: Sequences: 2 Score:921 Group 3: Sequences: 4 Score:865 Group 4: Sequences: 5 Score:855 Alignment Score 2975 CLUSTAL-Alignment file created [C:\Users\BioGeek\Documents\temp\my_sequences.aln]

The my_sequences.dnd file ClustalW creates, is a standard Newick tree file and Bio.Phylo can parse these:

from Bio import Phylo newick_path = os.path.join(root_dir, '{}.dnd'.format(filename)) tree = Phylo.read(newick_path, "newick") Phylo.draw_ascii(tree)

Which prints:

____________ Species_A ____| | |_____________________________________ Species_B | _| ____ Species_C |_________| | |_________________________ Species_D | |__________________________________________________________________ Species_E

Or, if you have matplotlib or pylab installed, you can create a graphic using the draw function:

tree.rooted = True Phylo.draw(tree, branch_labels=lambda c: c.branch_length)

which produces:

dendrogram

This dendrogram clearly illustrates what you observed: that species A and B are variants of the same organism and both diverge from a common clade from the root. Same goes with Species C and D, both diverge from another common clade from the root. Finally, Species E diverges from the main root because it is not related to Species A to D.

更多推荐

本文发布于:2023-07-24 00:08:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1239047.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:基因组   树状   Create   Dendogram   Genome

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!