提高从mongodb获取数据的性能(improve performance fetching data from mongodb)

编程入门行业动态更新时间:2024-10-13 16:16:29

提高从mongodb获取数据的性能(improve performance fetching data from mongodb) earnings = self.collection.find({}) #return 60k documents ---- data_dic = {'score': [], "reading_time": [] } for earning in earnings: data_dic['reading_time'].append(earning["reading_time"]) data_dic['score'].append(earning["score"]) ---- df = pd.DataFrame() df['reading_time'] = data_dic["reading_time"] df['score'] = data_dic["score"]

---之间的代码需要4秒才能完成。我该如何改进这个功能？

earnings = self.collection.find({}) #return 60k documents ---- data_dic = {'score': [], "reading_time": [] } for earning in earnings: data_dic['reading_time'].append(earning["reading_time"]) data_dic['score'].append(earning["score"]) ---- df = pd.DataFrame() df['reading_time'] = data_dic["reading_time"] df['score'] = data_dic["score"]

The code between --- takes 4 seconds to complete. How can I improve this function ?

最满意答案

时间由以下部分组成：Mongodb查询时间，用于传输数据的时间，网络往返，python列表操作。您可以优化每个。

一是减少要转移的数据量。由于您只需要reading_time和score ，因此您只能获取它们。如果您的平均文档大小很大，这种方法非常有效。

earnings = self.collection.find({}, {'reading_time': True, 'score': True})

第二。 Mongo批量传输有限数量的数据。数据包含多达60k行，传输数据需要多次。您可以调整cursor.batchSize以减少往返次数。

第三，如果可以，增加网络带宽。

第四。您可以通过利用numpy array来加速。它是一个类似C的数组数据结构，它比python列表更快。按索引预分配固定长度数组和分配值。这可以在调用list.append时避免内部调整。

count = earnings.count() score = np.empty((count,), dtype=float) reading_time = np.empty((count,), dtype='datetime64[us]') for i, earning in enumerate(earnings): score[i] = earning["score"] reading_time[i] = earning["reading_time"] df = pd.DataFrame() df['reading_time'] = reading_time df['score'] = score

The time consists of these parts: Mongodb query time, time used to transfer data, network round trip, python list operation. You can optimize each of them.

One is to reduce data amount to transfer. Since you only need reading_time and score, you can only fetch them. If your average document size is big, this approach is very effective.

earnings = self.collection.find({}, {'reading_time': True, 'score': True})

Second. Mongo transfer limited amount of data in a batch. The data contains up to 60k rows, it will take multiple times to transfer data. You can adjust cursor.batchSize to reduce round trip count.

Third, increase network bandwith if you can.

Fourth. You can accelerate by leveraging numpy array. It's a C-like array data structure which is faster than python list. Pre-allocate fixed length array and assigne value by index. This avoid internal adjustment when call list.append.

更多推荐

本文发布于:2023-08-07 00:27:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1459068.html