从基因组创建树状图(Create a Dendogram from Genome)

我想玩基因组数据：

Species_A = ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag Species_B = ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag Species_C = ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag Species_D = ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg Species_E = ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag

我想根据上面的基因组序列，根据这些生物彼此之间的相关程度来创建一个树形图。我先做的是计算每个物种的a，c，t和g的数量然后我创建了一个数组，然后绘制了一个树状图：

gen_size1 = len(Species_A) a1 = float(Species_A.count('a'))/float(gen_size1) c1 = float(Species_A.count('c'))/float(gen_size1) g1 = float(Species_A.count('g'))/float(gen_size1) t1 = float(Species_A.count('t'))/float(gen_size1) . . . gen_size5 = len(Species_E) a5 = float(Species_E.count('a'))/float(gen_size5) c5 = float(Species_E.count('c'))/float(gen_size5) g5 = float(Species_E.count('g'))/float(gen_size5) t5 = float(Species_E.count('t'))/float(gen_size5) my_genes = np.array([[a1,c1,g1,t1],[a2,c2,g2,t2],[a3,c3,g3,t3],[a4,c4,g4,t4],[a5,c5,g5,t5]]) plt.subplot(1,2,1) plt.title("Mononucleotide") linkage_matrix = linkage(my_genes, "single") print linkage_matrix dendrogram(linkage_matrix,truncate_mode='lastp', color_threshold=1, labels=[Species_A, Species_B, Species_C, Species_D, Species_E], show_leaf_counts=True) plt.show()

物种A和B是同一生物的变体，我期望两者都应该与根的共同进化枝不同，物种C和D同样如此，它们应该从根的另一个共同进化枝发散，然后物种E偏离主根，因为它与物种A到D无关。不幸的是，树状图结果与物种A和E混合，从一个普通的进化枝发散，然后是另一个进化枝中的物种C，D和B（相当混乱）。

我已经阅读了基因组序列的层次聚类，但我发现它只适用于二维系统，不幸的是我有4个维度，即a，c，t和g。还有其他策略吗？谢谢您的帮助！

I wanted to play around with genomic data:

Species_A = ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag Species_B = ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag Species_C = ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag Species_D = ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg Species_E = ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag

I wanted to create a dendrogram based on how close these organisms are related to each other given the genome sequence above. What I did first was to count the number of a's, c's, t's and g's of each species then I created an array, then plotted a dendrogram:

gen_size1 = len(Species_A) a1 = float(Species_A.count('a'))/float(gen_size1) c1 = float(Species_A.count('c'))/float(gen_size1) g1 = float(Species_A.count('g'))/float(gen_size1) t1 = float(Species_A.count('t'))/float(gen_size1) . . . gen_size5 = len(Species_E) a5 = float(Species_E.count('a'))/float(gen_size5) c5 = float(Species_E.count('c'))/float(gen_size5) g5 = float(Species_E.count('g'))/float(gen_size5) t5 = float(Species_E.count('t'))/float(gen_size5) my_genes = np.array([[a1,c1,g1,t1],[a2,c2,g2,t2],[a3,c3,g3,t3],[a4,c4,g4,t4],[a5,c5,g5,t5]]) plt.subplot(1,2,1) plt.title("Mononucleotide") linkage_matrix = linkage(my_genes, "single") print linkage_matrix dendrogram(linkage_matrix,truncate_mode='lastp', color_threshold=1, labels=[Species_A, Species_B, Species_C, Species_D, Species_E], show_leaf_counts=True) plt.show()

Species A and B are variants of the same organism and I am expecting that both should diverge from a common clade form the root, same goes with Species C and D which should diverge from another common clade from the root then with Species E diverging from the main root because it is not related to Species A to D. Unfortunately the dendrogram result was mixed up with Species A and E diverging from a common clade, then Species C, D and B in another clade (pretty messed up).

I have read about hierarchical clustering for genome sequence but I have observed that it only accommodates 2 dimensional system, unfortunately I have 4 dimensions which are a,c,t and g. Any other strategy for this? thanks for the help!

最满意答案

这是生物信息学中相当普遍的问题，因此您应该使用BioPython这样的生物信息库来构建此功能。

首先，使用序列创建一个多FASTA文件：

import os from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.Alphabet import generic_dna sequences = ['ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag', 'ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag', 'ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag', 'ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg', 'ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag'] my_records = [SeqRecord(Seq(sequence, generic_dna), id='Species_{}'.format(letter), description='Species_{}'.format(letter)) for sequence, letter in zip(sequences, 'ABCDE')] root_dir = r"C:\Users\BioGeek\Documents\temp" filename = 'my_sequences' fasta_path = os.path.join(root_dir, '{}.fasta'.format(filename)) SeqIO.write(my_records, fasta_path, "fasta")

这将创建如下所示的文件C:\Users\BioGeek\Documents\temp\my_sequences.fasta ：

>Species_A ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag >Species_B ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag >Species_C ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag >Species_D ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg >Species_E ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag

接下来，使用命令行工具ClustalW执行多序列对齐：

from Bio.Align.Applications import ClustalwCommandline clustalw_exe = r"C:\path\to\clustalw-2.1\clustalw2.exe" assert os.path.isfile(clustalw_exe), "Clustal W executable missing" clustalw_cline = ClustalwCommandline(clustalw_exe, infile=fasta_path) stdout, stderr = clustalw_cline() print stdout

这打印：

CLUSTAL 2.1 Multiple Sequence Alignments Sequence format is Pearson Sequence 1: Species_A 50 bp Sequence 2: Species_B 50 bp Sequence 3: Species_C 50 bp Sequence 4: Species_D 50 bp Sequence 5: Species_E 50 bp Start of Pairwise alignments Aligning... Sequences (1:2) Aligned. Score: 90 Sequences (1:3) Aligned. Score: 94 Sequences (1:4) Aligned. Score: 88 Sequences (1:5) Aligned. Score: 84 Sequences (2:3) Aligned. Score: 90 Sequences (2:4) Aligned. Score: 84 Sequences (2:5) Aligned. Score: 78 Sequences (3:4) Aligned. Score: 94 Sequences (3:5) Aligned. Score: 82 Sequences (4:5) Aligned. Score: 82 Guide tree file created: [C:\Users\BioGeek\Documents\temp\my_sequences.dnd] There are 4 groups Start of Multiple Alignment Aligning... Group 1: Sequences: 2 Score:912 Group 2: Sequences: 2 Score:921 Group 3: Sequences: 4 Score:865 Group 4: Sequences: 5 Score:855 Alignment Score 2975 CLUSTAL-Alignment file created [C:\Users\BioGeek\Documents\temp\my_sequences.aln]

ClustalW创建的my_sequences.dnd文件是标准的Newick树文件， Bio.Phylo可以解析这些：

from Bio import Phylo newick_path = os.path.join(root_dir, '{}.dnd'.format(filename)) tree = Phylo.read(newick_path, "newick") Phylo.draw_ascii(tree)

哪个印刷品：

或者，如果安装了matplotlib或pylab ，则可以使用draw函数创建图形：

tree.rooted = True Phylo.draw(tree, branch_labels=lambda c: c.branch_length)

产生：

树状图

这个树状图清楚地说明了你观察到的东西：物种A和B是同一生物的变种，并且都与根部的共同进化枝发散。与物种C和D相同，两者都与根部的另一个共同分支不同。最后，物种E偏离主根，因为它与物种A到D无关。

This is a fairly common problem in bioinformatics, so you should use a bioinformatics library like BioPython that has this functionality builtin.

First you create a multi FASTA file with your sequences:

import os from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.Alphabet import generic_dna sequences = ['ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag', 'ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag', 'ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag', 'ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg', 'ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag'] my_records = [SeqRecord(Seq(sequence, generic_dna), id='Species_{}'.format(letter), description='Species_{}'.format(letter)) for sequence, letter in zip(sequences, 'ABCDE')] root_dir = r"C:\Users\BioGeek\Documents\temp" filename = 'my_sequences' fasta_path = os.path.join(root_dir, '{}.fasta'.format(filename)) SeqIO.write(my_records, fasta_path, "fasta")

This creates the file C:\Users\BioGeek\Documents\temp\my_sequences.fasta that looks like this:

>Species_A ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag >Species_B ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag >Species_C ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag >Species_D ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg >Species_E ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag

Next, use the command line tool ClustalW to do a multiple sequence alignment:

from Bio.Align.Applications import ClustalwCommandline clustalw_exe = r"C:\path\to\clustalw-2.1\clustalw2.exe" assert os.path.isfile(clustalw_exe), "Clustal W executable missing" clustalw_cline = ClustalwCommandline(clustalw_exe, infile=fasta_path) stdout, stderr = clustalw_cline() print stdout

This prints:

CLUSTAL 2.1 Multiple Sequence Alignments Sequence format is Pearson Sequence 1: Species_A 50 bp Sequence 2: Species_B 50 bp Sequence 3: Species_C 50 bp Sequence 4: Species_D 50 bp Sequence 5: Species_E 50 bp Start of Pairwise alignments Aligning... Sequences (1:2) Aligned. Score: 90 Sequences (1:3) Aligned. Score: 94 Sequences (1:4) Aligned. Score: 88 Sequences (1:5) Aligned. Score: 84 Sequences (2:3) Aligned. Score: 90 Sequences (2:4) Aligned. Score: 84 Sequences (2:5) Aligned. Score: 78 Sequences (3:4) Aligned. Score: 94 Sequences (3:5) Aligned. Score: 82 Sequences (4:5) Aligned. Score: 82 Guide tree file created: [C:\Users\BioGeek\Documents\temp\my_sequences.dnd] There are 4 groups Start of Multiple Alignment Aligning... Group 1: Sequences: 2 Score:912 Group 2: Sequences: 2 Score:921 Group 3: Sequences: 4 Score:865 Group 4: Sequences: 5 Score:855 Alignment Score 2975 CLUSTAL-Alignment file created [C:\Users\BioGeek\Documents\temp\my_sequences.aln]

The my_sequences.dnd file ClustalW creates, is a standard Newick tree file and Bio.Phylo can parse these:

from Bio import Phylo newick_path = os.path.join(root_dir, '{}.dnd'.format(filename)) tree = Phylo.read(newick_path, "newick") Phylo.draw_ascii(tree)

Which prints:

Or, if you have matplotlib or pylab installed, you can create a graphic using the draw function:

tree.rooted = True Phylo.draw(tree, branch_labels=lambda c: c.branch_length)

which produces:

dendrogram

This dendrogram clearly illustrates what you observed: that species A and B are variants of the same organism and both diverge from a common clade from the root. Same goes with Species C and D, both diverge from another common clade from the root. Finally, Species E diverges from the main root because it is not related to Species A to D.

更多推荐

从基因组创建树状图(Create a Dendogram from Genome)

最满意答案

发布评论取消回复

最近发表

热门文章

标签列表