访问大熊猫数据一百万次

访问大熊猫数据一百万次 - 需要提高效率(accessing pandas data a million times -need to improve efficiency)

我是一名试图验证实验的生物学家。在我的实验中，我在特定治疗后发现了71个突变。为了确定这些突变是否真的是由于我的治疗，我想将它们与一组随机产生的突变进行比较。有人向我建议，我可能会尝试生成一百万套71个随机突变用于统计比较。

首先，我有一个数据框，其中包含感兴趣的基因组中的7000个基因。我知道他们的开始和结束位置。数据帧的前五行如下所示：

transcript_id protein_id start end kogClass 0 g2.t1 695054 1 1999 Replication, recombination and repair 1 g3.t1 630170 2000 3056 General function prediction only 2 g5.t1 695056 3057 4087 Signal transduction mechanisms 3 g6.t1 671982 4088 5183 N/A 4 g7.t1 671985 5184 8001 Chromatin structure and dynamics

现在大约有一百万套71个随机突变：我已经编写了一个我称之为一百万次的函数，它看起来效率不高，因为在4小时后它只有1/10。这是我的代码。如果有人能提出加快速度的方法，我会欠你一杯啤酒！我的赞赏。

def get_71_random_genes(df, outfile): # how many nucleotides are there in all transcripts? end_pos_last_gene = df.iloc[-1,3] # this loop will go 71 times for i in range(71): # generate a number from 1 to the end of all transcripts random_number = randint(1, end_pos_last_gene) # this is the boolean condition - checks which gene a random number falls within mask = (df['start'] <= random_number) & (df['end'] >= random_number) # collect the rows that match data = df.loc[mask] # write data to file. data.to_csv(outfile, sep='\t', index=False, header=False)

I am a biologist trying to validate an experiment. In my experiment, I have found 71 mutations after a particular treatment. To determine if these mutations are truly due to my treatment, I want to compare them to a set of randomly generated mutations. It was suggested to me that I might try to generate a million sets of 71 random mutations for statistical comparison.

To start with, I have a dataframe with the 7000 genes in the genome of interest. I know their start and end positions. The first five rows of the dataframe look like this:

transcript_id protein_id start end kogClass 0 g2.t1 695054 1 1999 Replication, recombination and repair 1 g3.t1 630170 2000 3056 General function prediction only 2 g5.t1 695056 3057 4087 Signal transduction mechanisms 3 g6.t1 671982 4088 5183 N/A 4 g7.t1 671985 5184 8001 Chromatin structure and dynamics

Now about the million sets of 71 random mutations: I have written a function that I call a million times and it seems not to be very efficient because after 4 hours it was only 1/10 of the way through. Here is my code. If anyone can suggest a way to speed things up I would owe you a beer! And my appreciation.

def get_71_random_genes(df, outfile): # how many nucleotides are there in all transcripts? end_pos_last_gene = df.iloc[-1,3] # this loop will go 71 times for i in range(71): # generate a number from 1 to the end of all transcripts random_number = randint(1, end_pos_last_gene) # this is the boolean condition - checks which gene a random number falls within mask = (df['start'] <= random_number) & (df['end'] >= random_number) # collect the rows that match data = df.loc[mask] # write data to file. data.to_csv(outfile, sep='\t', index=False, header=False)

最满意答案

我很确定以下所有内容：

for i in range(71): # generate a number from 1 to the end of all transcripts random_number = randint(1, end_pos_last_gene) # this is the boolean condition - checks which gene a random number falls within mask = (df['start'] <= random_number) & (df['end'] >= random_number) # collect the rows that match data = df.loc[mask] # write data to file. data.to_csv(outfile, sep='\t', index=False, header=False)

从数据帧中选择71个随机行而不进行替换。请注意，这是永远的，因为每次你这样做

(df['start'] <= random_number) & (df['end'] >= random_number)

您遍历整个数据框三次，然后再执行以下操作：

data = df.loc[mask]

这是一种非常低效的抽样方式。你可以通过随机抽样71个索引，然后直接在数据帧上使用这些索引（这甚至不需要在数据帧上进行一次完整传递）来更有效地做到这一点。但是您不需要这样做， pd.DataFrame对象已经实现了一个有效的示例方法 ，因此请遵守：

In [12]: df = pd.DataFrame(np.random.randint(0, 20, (10, 10)), columns=["c%d"%d for d in range(10)]) In [13]: df Out[13]: c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 0 13 0 19 5 6 17 5 14 5 15 1 2 4 0 16 19 11 16 3 11 1 2 18 3 1 18 12 9 13 2 18 12 3 2 6 14 12 1 2 19 16 0 14 4 17 5 6 13 7 15 10 18 13 8 5 7 19 18 3 1 11 14 6 13 16 6 13 5 11 0 2 15 7 11 0 2 7 0 19 11 3 19 3 3 9 8 10 8 6 8 9 3 12 18 19 8 11 2 9 8 17 16 0 8 7 17 11 11 0 In [14]: df.sample(3, replace=True) Out[14]: c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 0 13 0 19 5 6 17 5 14 5 15 3 2 6 14 12 1 2 19 16 0 14 3 2 6 14 12 1 2 19 16 0 14 In [15]: df.sample(3, replace=True) Out[15]: c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 9 8 17 16 0 8 7 17 11 11 0 4 17 5 6 13 7 15 10 18 13 8 2 18 3 1 18 12 9 13 2 18 12 In [16]: df.sample(3, replace=True) Out[16]: c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 3 2 6 14 12 1 2 19 16 0 14 8 6 8 9 3 12 18 19 8 11 2 4 17 5 6 13 7 15 10 18 13 8

所以只需将该循环替换为：

df.sample(71, replace=True).to_csv(outfile, sep='\t', index=False, header=False)

注意，这也减少了I / O开销！

所以，只是做一个快速测试：

In [4]: import time ...: start = time.time() ...: with open('test.csv', 'w') as f: ...: for _ in range(1000): ...: df.sample(71, replace=True).to_csv(f, header=None, index=False) ...: stop = time.time() ...: In [5]: stop - start Out[5]: 0.789172887802124

因此，线性推断，我会考虑1,000,000次：

In [8]: (stop - start) * 1000 Out[8]: 789.172887802124

秒，所以有点超过10分钟

In [10]: !wc -l test.csv 71000 test.csv

编辑以添加更有效的方法

因此，创建一个映射到数据框中的指标的数组：

size = df.end.max() nucleotide_array = np.zeros(size, dtype=np.int) # this could get out of hand without being careful of our size for row in df.itertuples(): # might be alittle slow, but its a one-time upfront cost i = row.start - 1 j = row.end nucleotide_array[i:j] = row.Index # sampling scheme: with open('test.csv', 'w') as f: for _ in range(1000): # how ever many experiments snps = np.random.choice(nucleotide_array, 71, replace=True) df.loc[snps].to_csv(f, header=None, index=False)

注意上面是一个快速草图，还没有真正测试过。它做了假设，但我认为他们坚持，无论如何，你可以很容易地使你的df成功，这样它就会起作用。

I'm pretty sure that all the following does:

for i in range(71): # generate a number from 1 to the end of all transcripts random_number = randint(1, end_pos_last_gene) # this is the boolean condition - checks which gene a random number falls within mask = (df['start'] <= random_number) & (df['end'] >= random_number) # collect the rows that match data = df.loc[mask] # write data to file. data.to_csv(outfile, sep='\t', index=False, header=False)

Is select 71 random rows from the data-frame without replacement. Note, this is taking forever because each time you do

(df['start'] <= random_number) & (df['end'] >= random_number)

You iterate through the entire data-frame three times, and then an additional time when you do:

data = df.loc[mask]

This is an incredibly inefficient way to sample rows. You could do this much more efficiently by randomly sampling 71 indices, then using those indices directly on the data-frame (which would not require even a single full pass over the data-frame). But you don't need to do that, pd.DataFrame objects already implement an efficient sample method, so observe:

In [12]: df = pd.DataFrame(np.random.randint(0, 20, (10, 10)), columns=["c%d"%d for d in range(10)]) In [13]: df Out[13]: c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 0 13 0 19 5 6 17 5 14 5 15 1 2 4 0 16 19 11 16 3 11 1 2 18 3 1 18 12 9 13 2 18 12 3 2 6 14 12 1 2 19 16 0 14 4 17 5 6 13 7 15 10 18 13 8 5 7 19 18 3 1 11 14 6 13 16 6 13 5 11 0 2 15 7 11 0 2 7 0 19 11 3 19 3 3 9 8 10 8 6 8 9 3 12 18 19 8 11 2 9 8 17 16 0 8 7 17 11 11 0 In [14]: df.sample(3, replace=True) Out[14]: c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 0 13 0 19 5 6 17 5 14 5 15 3 2 6 14 12 1 2 19 16 0 14 3 2 6 14 12 1 2 19 16 0 14 In [15]: df.sample(3, replace=True) Out[15]: c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 9 8 17 16 0 8 7 17 11 11 0 4 17 5 6 13 7 15 10 18 13 8 2 18 3 1 18 12 9 13 2 18 12 In [16]: df.sample(3, replace=True) Out[16]: c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 3 2 6 14 12 1 2 19 16 0 14 8 6 8 9 3 12 18 19 8 11 2 4 17 5 6 13 7 15 10 18 13 8

So just replace that loop with:

df.sample(71, replace=True).to_csv(outfile, sep='\t', index=False, header=False)

Note, this also cuts down on I/O overhead!

So, just to do a quick test:

In [4]: import time ...: start = time.time() ...: with open('test.csv', 'w') as f: ...: for _ in range(1000): ...: df.sample(71, replace=True).to_csv(f, header=None, index=False) ...: stop = time.time() ...: In [5]: stop - start Out[5]: 0.789172887802124

So, extrapolating linearly, I'd gesstimate 1,000,000 times would take about:

In [8]: (stop - start) * 1000 Out[8]: 789.172887802124

Seconds, so a little over 10 minutes

In [10]: !wc -l test.csv 71000 test.csv

Edit to add a more valid approach

So, create an array that maps to indicies in the data-frame:

size = df.end.max() nucleotide_array = np.zeros(size, dtype=np.int) # this could get out of hand without being careful of our size for row in df.itertuples(): # might be alittle slow, but its a one-time upfront cost i = row.start - 1 j = row.end nucleotide_array[i:j] = row.Index # sampling scheme: with open('test.csv', 'w') as f: for _ in range(1000): # how ever many experiments snps = np.random.choice(nucleotide_array, 71, replace=True) df.loc[snps].to_csv(f, header=None, index=False)

Note the above is a quick sketch, haven't really tested it out. It makes assumptions, but I think they hold and anyway, you could easily munge your df so it will work.

更多推荐

访问大熊猫数据一百万次

最满意答案

编辑以添加更有效的方法

Edit to add a more valid approach

发布评论取消回复

最近发表

热门文章

标签列表