我有一个fasta文件如下:
>scaf1 AAAAAATGTGTGTGTGTGTGYAA AAAAACACGTGTGTGTG >scaf2 ACGTGTGTGTGATGTGGY AAAAAATGTGNNNNNNNNYACGTGTGTGTGTGTGTACACWSK >scaf3 AAAGTGTGTTGTGAAACACACYAAW我想将它读入字典中,将属于一个序列的多行写入一个键,输出结果为:
{'scaf1': 'AAAAAATGTGTGTGTGTGTGYAAAAAAACACGTGTGTGTG', 'scaf2': 'ACGTGTGTGTGATGTGGYAAAAAATGTGNNNNNNNNYACGTGTGTGTGTGTGTACACWSK', 'scaf3': 'AAAGTGTGTTGTGAAACACACYAAW'}我写的脚本是:
import sys from collections import defaultdict fastaseq = open(sys.argv[1], "r") def readfasta(fastaseq): fasta_dict = {} for line in fastaseq: if line.startswith('>'): header = line.strip('\n')[1:] sequence = '' else: sequence = sequence + line.strip('\n') fasta_dict[header] = sequence return fasta_dict fastadict = readfasta(fastaseq) print fastadict它对于这样的文件正常而且快速地工作,但是当文件大小增加(大约1.5Gb)时,则它变得太慢。 花费时间的步骤是sequence的连接部分。 我想知道是否有更快的方式将线连接到单个字符串?
I have a fasta file as follows:
>scaf1 AAAAAATGTGTGTGTGTGTGYAA AAAAACACGTGTGTGTG >scaf2 ACGTGTGTGTGATGTGGY AAAAAATGTGNNNNNNNNYACGTGTGTGTGTGTGTACACWSK >scaf3 AAAGTGTGTTGTGAAACACACYAAWI want to read it into a dictionary in a away that multiple lines belonging to one sequence go to one key, the output would be:
{'scaf1': 'AAAAAATGTGTGTGTGTGTGYAAAAAAACACGTGTGTGTG', 'scaf2': 'ACGTGTGTGTGATGTGGYAAAAAATGTGNNNNNNNNYACGTGTGTGTGTGTGTACACWSK', 'scaf3': 'AAAGTGTGTTGTGAAACACACYAAW'}The script I have written is:
import sys from collections import defaultdict fastaseq = open(sys.argv[1], "r") def readfasta(fastaseq): fasta_dict = {} for line in fastaseq: if line.startswith('>'): header = line.strip('\n')[1:] sequence = '' else: sequence = sequence + line.strip('\n') fasta_dict[header] = sequence return fasta_dict fastadict = readfasta(fastaseq) print fastadictIt works correctly and fast for such a file but when the file size increases (that is about 1.5 Gb), then it becomes too slow. The step that is taking time is the concatenation part of the sequence. I was wondering if there is any faster way of concatenating the lines to a single string?
最满意答案
使用+连接字符串需要创建一个新的字符串,因为Python字符串是不可变的,这是耗时的。
所有字符串准备就绪后,使用str.join将它们连接起来,
import sys def read_fasta(filename): fasta_dict = {} l = list() header = None with open(filename, 'r') as f: for line in f: if line.startswith('>'): # a new record # save the previous record to the dict if header: fasta_dict[header] = ''.join(l) del l[:] # empty the list header = line.strip().split('>')[1] else: l.append(line.strip()) # save the last record fasta_dict[header] = ''.join(l) return fasta_dict fastadict = read_fasta(sys.argv[1]) print(fastadict)Concatenating strings with + requires to create a new string since Python strings are immutable, which is time consumer.
Use str.join to concatenate them after all strings are ready,
import sys def read_fasta(filename): fasta_dict = {} l = list() header = None with open(filename, 'r') as f: for line in f: if line.startswith('>'): # a new record # save the previous record to the dict if header: fasta_dict[header] = ''.join(l) del l[:] # empty the list header = line.strip().split('>')[1] else: l.append(line.strip()) # save the last record fasta_dict[header] = ''.join(l) return fasta_dict fastadict = read_fasta(sys.argv[1]) print(fastadict)更多推荐
发布评论