将行连接到python中的字符串(Concatenating lines to a string in python)

编程入门 行业动态 更新时间:2024-10-24 17:30:55
将行连接到python中的字符串(Concatenating lines to a string in python)

我有一个fasta文件如下:

>scaf1 AAAAAATGTGTGTGTGTGTGYAA AAAAACACGTGTGTGTG >scaf2 ACGTGTGTGTGATGTGGY AAAAAATGTGNNNNNNNNYACGTGTGTGTGTGTGTACACWSK >scaf3 AAAGTGTGTTGTGAAACACACYAAW

我想将它读入字典中,将属于一个序列的多行写入一个键,输出结果为:

{'scaf1': 'AAAAAATGTGTGTGTGTGTGYAAAAAAACACGTGTGTGTG', 'scaf2': 'ACGTGTGTGTGATGTGGYAAAAAATGTGNNNNNNNNYACGTGTGTGTGTGTGTACACWSK', 'scaf3': 'AAAGTGTGTTGTGAAACACACYAAW'}

我写的脚本是:

import sys from collections import defaultdict fastaseq = open(sys.argv[1], "r") def readfasta(fastaseq): fasta_dict = {} for line in fastaseq: if line.startswith('>'): header = line.strip('\n')[1:] sequence = '' else: sequence = sequence + line.strip('\n') fasta_dict[header] = sequence return fasta_dict fastadict = readfasta(fastaseq) print fastadict

它对于这样的文件正常而且快速地工作,但是当文件大小增加(大约1.5Gb)时,则它变得太慢。 花费时间的步骤是sequence的连接部分。 我想知道是否有更快的方式将线连接到单个字符串?

I have a fasta file as follows:

>scaf1 AAAAAATGTGTGTGTGTGTGYAA AAAAACACGTGTGTGTG >scaf2 ACGTGTGTGTGATGTGGY AAAAAATGTGNNNNNNNNYACGTGTGTGTGTGTGTACACWSK >scaf3 AAAGTGTGTTGTGAAACACACYAAW

I want to read it into a dictionary in a away that multiple lines belonging to one sequence go to one key, the output would be:

{'scaf1': 'AAAAAATGTGTGTGTGTGTGYAAAAAAACACGTGTGTGTG', 'scaf2': 'ACGTGTGTGTGATGTGGYAAAAAATGTGNNNNNNNNYACGTGTGTGTGTGTGTACACWSK', 'scaf3': 'AAAGTGTGTTGTGAAACACACYAAW'}

The script I have written is:

import sys from collections import defaultdict fastaseq = open(sys.argv[1], "r") def readfasta(fastaseq): fasta_dict = {} for line in fastaseq: if line.startswith('>'): header = line.strip('\n')[1:] sequence = '' else: sequence = sequence + line.strip('\n') fasta_dict[header] = sequence return fasta_dict fastadict = readfasta(fastaseq) print fastadict

It works correctly and fast for such a file but when the file size increases (that is about 1.5 Gb), then it becomes too slow. The step that is taking time is the concatenation part of the sequence. I was wondering if there is any faster way of concatenating the lines to a single string?

最满意答案

使用+连接字符串需要创建一个新的字符串,因为Python字符串是不可变的,这是耗时的。

所有字符串准备就绪后,使用str.join将它们连接起来,

import sys def read_fasta(filename): fasta_dict = {} l = list() header = None with open(filename, 'r') as f: for line in f: if line.startswith('>'): # a new record # save the previous record to the dict if header: fasta_dict[header] = ''.join(l) del l[:] # empty the list header = line.strip().split('>')[1] else: l.append(line.strip()) # save the last record fasta_dict[header] = ''.join(l) return fasta_dict fastadict = read_fasta(sys.argv[1]) print(fastadict)

Concatenating strings with + requires to create a new string since Python strings are immutable, which is time consumer.

Use str.join to concatenate them after all strings are ready,

import sys def read_fasta(filename): fasta_dict = {} l = list() header = None with open(filename, 'r') as f: for line in f: if line.startswith('>'): # a new record # save the previous record to the dict if header: fasta_dict[header] = ''.join(l) del l[:] # empty the list header = line.strip().split('>')[1] else: l.append(line.strip()) # save the last record fasta_dict[header] = ''.join(l) return fasta_dict fastadict = read_fasta(sys.argv[1]) print(fastadict)

更多推荐

本文发布于:2023-07-26 03:51:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1270717.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:字符串   连接到   python   string   Concatenating

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!