我想玩基因组数据:
Species_A = ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag Species_B = ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag Species_C = ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag Species_D = ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg Species_E = ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag我想根据上面的基因组序列,根据这些生物彼此之间的相关程度来创建一个树形图。 我先做的是计算每个物种的a,c,t和g的数量然后我创建了一个数组,然后绘制了一个树状图:
gen_size1 = len(Species_A) a1 = float(Species_A.count('a'))/float(gen_size1) c1 = float(Species_A.count('c'))/float(gen_size1) g1 = float(Species_A.count('g'))/float(gen_size1) t1 = float(Species_A.count('t'))/float(gen_size1) . . . gen_size5 = len(Species_E) a5 = float(Species_E.count('a'))/float(gen_size5) c5 = float(Species_E.count('c'))/float(gen_size5) g5 = float(Species_E.count('g'))/float(gen_size5) t5 = float(Species_E.count('t'))/float(gen_size5) my_genes = np.array([[a1,c1,g1,t1],[a2,c2,g2,t2],[a3,c3,g3,t3],[a4,c4,g4,t4],[a5,c5,g5,t5]]) plt.subplot(1,2,1) plt.title("Mononucleotide") linkage_matrix = linkage(my_genes, "single") print linkage_matrix dendrogram(linkage_matrix,truncate_mode='lastp', color_threshold=1, labels=[Species_A, Species_B, Species_C, Species_D, Species_E], show_leaf_counts=True) plt.show()物种A和B是同一生物的变体,我期望两者都应该与根的共同进化枝不同,物种C和D同样如此,它们应该从根的另一个共同进化枝发散,然后物种E偏离主根,因为它与物种A到D无关。不幸的是,树状图结果与物种A和E混合,从一个普通的进化枝发散,然后是另一个进化枝中的物种C,D和B(相当混乱)。
我已经阅读了基因组序列的层次聚类,但我发现它只适用于二维系统,不幸的是我有4个维度,即a,c,t和g。 还有其他策略吗? 谢谢您的帮助!
I wanted to play around with genomic data:
Species_A = ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag Species_B = ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag Species_C = ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag Species_D = ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg Species_E = ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacagI wanted to create a dendrogram based on how close these organisms are related to each other given the genome sequence above. What I did first was to count the number of a's, c's, t's and g's of each species then I created an array, then plotted a dendrogram:
gen_size1 = len(Species_A) a1 = float(Species_A.count('a'))/float(gen_size1) c1 = float(Species_A.count('c'))/float(gen_size1) g1 = float(Species_A.count('g'))/float(gen_size1) t1 = float(Species_A.count('t'))/float(gen_size1) . . . gen_size5 = len(Species_E) a5 = float(Species_E.count('a'))/float(gen_size5) c5 = float(Species_E.count('c'))/float(gen_size5) g5 = float(Species_E.count('g'))/float(gen_size5) t5 = float(Species_E.count('t'))/float(gen_size5) my_genes = np.array([[a1,c1,g1,t1],[a2,c2,g2,t2],[a3,c3,g3,t3],[a4,c4,g4,t4],[a5,c5,g5,t5]]) plt.subplot(1,2,1) plt.title("Mononucleotide") linkage_matrix = linkage(my_genes, "single") print linkage_matrix dendrogram(linkage_matrix,truncate_mode='lastp', color_threshold=1, labels=[Species_A, Species_B, Species_C, Species_D, Species_E], show_leaf_counts=True) plt.show()Species A and B are variants of the same organism and I am expecting that both should diverge from a common clade form the root, same goes with Species C and D which should diverge from another common clade from the root then with Species E diverging from the main root because it is not related to Species A to D. Unfortunately the dendrogram result was mixed up with Species A and E diverging from a common clade, then Species C, D and B in another clade (pretty messed up).
I have read about hierarchical clustering for genome sequence but I have observed that it only accommodates 2 dimensional system, unfortunately I have 4 dimensions which are a,c,t and g. Any other strategy for this? thanks for the help!
最满意答案
这是生物信息学中相当普遍的问题,因此您应该使用BioPython这样的生物信息库来构建此功能。
首先,使用序列创建一个多FASTA文件:
import os from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.Alphabet import generic_dna sequences = ['ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag', 'ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag', 'ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag', 'ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg', 'ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag'] my_records = [SeqRecord(Seq(sequence, generic_dna), id='Species_{}'.format(letter), description='Species_{}'.format(letter)) for sequence, letter in zip(sequences, 'ABCDE')] root_dir = r"C:\Users\BioGeek\Documents\temp" filename = 'my_sequences' fasta_path = os.path.join(root_dir, '{}.fasta'.format(filename)) SeqIO.write(my_records, fasta_path, "fasta")这将创建如下所示的文件C:\Users\BioGeek\Documents\temp\my_sequences.fasta :
>Species_A ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag >Species_B ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag >Species_C ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag >Species_D ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg >Species_E ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag接下来,使用命令行工具ClustalW执行多序列对齐:
from Bio.Align.Applications import ClustalwCommandline clustalw_exe = r"C:\path\to\clustalw-2.1\clustalw2.exe" assert os.path.isfile(clustalw_exe), "Clustal W executable missing" clustalw_cline = ClustalwCommandline(clustalw_exe, infile=fasta_path) stdout, stderr = clustalw_cline() print stdout这打印:
CLUSTAL 2.1 Multiple Sequence Alignments Sequence format is Pearson Sequence 1: Species_A 50 bp Sequence 2: Species_B 50 bp Sequence 3: Species_C 50 bp Sequence 4: Species_D 50 bp Sequence 5: Species_E 50 bp Start of Pairwise alignments Aligning... Sequences (1:2) Aligned. Score: 90 Sequences (1:3) Aligned. Score: 94 Sequences (1:4) Aligned. Score: 88 Sequences (1:5) Aligned. Score: 84 Sequences (2:3) Aligned. Score: 90 Sequences (2:4) Aligned. Score: 84 Sequences (2:5) Aligned. Score: 78 Sequences (3:4) Aligned. Score: 94 Sequences (3:5) Aligned. Score: 82 Sequences (4:5) Aligned. Score: 82 Guide tree file created: [C:\Users\BioGeek\Documents\temp\my_sequences.dnd] There are 4 groups Start of Multiple Alignment Aligning... Group 1: Sequences: 2 Score:912 Group 2: Sequences: 2 Score:921 Group 3: Sequences: 4 Score:865 Group 4: Sequences: 5 Score:855 Alignment Score 2975 CLUSTAL-Alignment file created [C:\Users\BioGeek\Documents\temp\my_sequences.aln]ClustalW创建的my_sequences.dnd文件是标准的Newick树文件 , Bio.Phylo可以解析这些:
from Bio import Phylo newick_path = os.path.join(root_dir, '{}.dnd'.format(filename)) tree = Phylo.read(newick_path, "newick") Phylo.draw_ascii(tree)哪个印刷品:
____________ Species_A ____| | |_____________________________________ Species_B | _| ____ Species_C |_________| | |_________________________ Species_D | |__________________________________________________________________ Species_E或者,如果安装了matplotlib或pylab ,则可以使用draw函数创建图形:
tree.rooted = True Phylo.draw(tree, branch_labels=lambda c: c.branch_length)产生:
这个树状图清楚地说明了你观察到的东西:物种A和B是同一生物的变种,并且都与根部的共同进化枝发散。 与物种C和D相同,两者都与根部的另一个共同分支不同。 最后,物种E偏离主根,因为它与物种A到D无关。
This is a fairly common problem in bioinformatics, so you should use a bioinformatics library like BioPython that has this functionality builtin.
First you create a multi FASTA file with your sequences:
import os from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.Alphabet import generic_dna sequences = ['ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag', 'ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag', 'ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag', 'ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg', 'ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag'] my_records = [SeqRecord(Seq(sequence, generic_dna), id='Species_{}'.format(letter), description='Species_{}'.format(letter)) for sequence, letter in zip(sequences, 'ABCDE')] root_dir = r"C:\Users\BioGeek\Documents\temp" filename = 'my_sequences' fasta_path = os.path.join(root_dir, '{}.fasta'.format(filename)) SeqIO.write(my_records, fasta_path, "fasta")This creates the file C:\Users\BioGeek\Documents\temp\my_sequences.fasta that looks like this:
>Species_A ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag >Species_B ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag >Species_C ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag >Species_D ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg >Species_E ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacagNext, use the command line tool ClustalW to do a multiple sequence alignment:
from Bio.Align.Applications import ClustalwCommandline clustalw_exe = r"C:\path\to\clustalw-2.1\clustalw2.exe" assert os.path.isfile(clustalw_exe), "Clustal W executable missing" clustalw_cline = ClustalwCommandline(clustalw_exe, infile=fasta_path) stdout, stderr = clustalw_cline() print stdoutThis prints:
CLUSTAL 2.1 Multiple Sequence Alignments Sequence format is Pearson Sequence 1: Species_A 50 bp Sequence 2: Species_B 50 bp Sequence 3: Species_C 50 bp Sequence 4: Species_D 50 bp Sequence 5: Species_E 50 bp Start of Pairwise alignments Aligning... Sequences (1:2) Aligned. Score: 90 Sequences (1:3) Aligned. Score: 94 Sequences (1:4) Aligned. Score: 88 Sequences (1:5) Aligned. Score: 84 Sequences (2:3) Aligned. Score: 90 Sequences (2:4) Aligned. Score: 84 Sequences (2:5) Aligned. Score: 78 Sequences (3:4) Aligned. Score: 94 Sequences (3:5) Aligned. Score: 82 Sequences (4:5) Aligned. Score: 82 Guide tree file created: [C:\Users\BioGeek\Documents\temp\my_sequences.dnd] There are 4 groups Start of Multiple Alignment Aligning... Group 1: Sequences: 2 Score:912 Group 2: Sequences: 2 Score:921 Group 3: Sequences: 4 Score:865 Group 4: Sequences: 5 Score:855 Alignment Score 2975 CLUSTAL-Alignment file created [C:\Users\BioGeek\Documents\temp\my_sequences.aln]The my_sequences.dnd file ClustalW creates, is a standard Newick tree file and Bio.Phylo can parse these:
from Bio import Phylo newick_path = os.path.join(root_dir, '{}.dnd'.format(filename)) tree = Phylo.read(newick_path, "newick") Phylo.draw_ascii(tree)Which prints:
____________ Species_A ____| | |_____________________________________ Species_B | _| ____ Species_C |_________| | |_________________________ Species_D | |__________________________________________________________________ Species_EOr, if you have matplotlib or pylab installed, you can create a graphic using the draw function:
tree.rooted = True Phylo.draw(tree, branch_labels=lambda c: c.branch_length)which produces:
This dendrogram clearly illustrates what you observed: that species A and B are variants of the same organism and both diverge from a common clade from the root. Same goes with Species C and D, both diverge from another common clade from the root. Finally, Species E diverges from the main root because it is not related to Species A to D.
更多推荐
发布评论