pymongo.errors.DocumentTooLarge: ‘aggregate‘ command document too large 解决

编程入门行业动态更新时间:2024-10-17 19:29:00

pymongo.<a href=https://www.elefans.com/category/jswz/34/1766625.html style= errors.DocumentTooLarge: ‘aggregate‘ command document too large 解决"/>

pymongo.errors.DocumentTooLarge: ‘aggregate‘ command document too large 解决

先说明：本文没有介绍直接快速的方式（例如：加个参数，改个配置等），而是通过转换思路解决的，介意勿看！

最近在使用mongo做聚合统计的时候遇到的问题，经过一通查询，知道mongdb和pymongo16MB有文档大小限制。之前只知道mongo有一条document文档会有16MB的限制，这次聚合match查询条件里的list大小也报了超限！

需求：有个mysql订单库和mongo订单库，mongo里存的订单包含mysql存的订单，目标是统计在mongo里不在mysql里的订单的电量和订单数！

之前的代码：

filter_pipe = {"StartType": 3, "EndTime": {"$lte": end_date, "$gte": start_date}, "OrderID": {"$nin": all_mysql_orders}}
ret = db.get_collection("table_name").aggregate([{"$match": filter_pipe},{"$group": {"_id": {}, "totalPower": {"$sum": "$TotalPower"}, "count": {"$sum": 1}}}])

all_mysql_orders 有24MB，超限报了错：

File "/usr/local/python3/lib/python3.7/site-packages/pymongo/message.py", line 1070, in _raise_document_too_large
raise DocumentTooLarge("%r command document too large" % (operation,))
pymongo.errors.DocumentTooLarge: 'aggregate' command document too large

问了同事，说可以："加大mongo的内存配置，然后重启。因为数据库是基于内存做的，另外还需要扩大buffer"，但这是生产的数据，不能做这种冒失的操作，万一搞崩了，直接凉凉!

于是想到曲线救国，修改统计程序，找出所有满足条件的OrderID把$nin改成 $in，不就可以了，但是经过修改重跑后发现in后的订单列表数据量比24MB还大，这条路也不行了！

最后尝试分页聚合修改，先跑一个分页聚合的测试方法：

def test_method():aggregate_num = 5diff = [i for i in range(0, 11)]print(diff)# 整页page = len(diff) // aggregate_num# 剩余条数left = len(diff) % aggregate_num# 能被整除的情况if page and not left:print("page: {}，left：{}".format(page, 0))# 不能被整除的情况else:print("page: {}，left：{}".format(page, left))page += 1# 初始总数据for i in range(page):begin_index = i * aggregate_numend_index = (i + 1)*aggregate_numprint(begin_index, end_index, diff[begin_index: end_index])

没毛病，修改统计程序的代码为分页聚合，修改后的代码：

filter_pipes = {"StartType": 3, "EndTime": {"$lte": end_date, "$gte": start_date}}
ret0 = db.get_collection("table_name").find(filter_pipes, {"OrderID": 1})
# mongo满足条件的总订单数
all_in_mongo_ids = set([order_info['OrderID'] for order_info in ret0])
# 40万数据10.3MB，一个订单号是27bytes，可以使用聚合方法
aggregate_num = 400000
# 需要聚合统计订单的订单号
diff = all_in_mongo_ids.difference(set(all_mysql_ids))
# 整页
page = len(diff) // aggregate_num
# 剩余条数
left = len(diff) % aggregate_num
# 能被整除的情况
if page and not left:print("page: {}，left：{}".format(page, 0))
# 不能被整除的情况
else:print("page: {}，left：{}".format(page, left))page += 1
diff_list = list(diff)
total_power, order_num = 0, 0
for i in range(page):begin_index = i * aggregate_numend_index = (i + 1) * aggregate_numfilter_pipes.update({"OrderID": {"$in": diff_list[begin_index: end_index]}})ret = db.get_collection("table_name").aggregate([{"$match": filter_pipes},{"$group": {"_id": {}, "totalPower": {"$sum": "$TotalPower"}, "count": {"$sum": 1}}}])result = [info for info in ret]order_num += result [0]['count']total_power += result [0]['totalPower']

跑起来很快，交差！

还有同事提出，可以将mysql里订单用dict的形式取出，基于hash判断在不在字典里，会快些，虽然判断会快些，但查出mongo所有的数据（300多万条），再进行循环统计电量还是会拉慢速度！

注：希望文章可以帮到你，如果有问题的地方，可以留言一起交流！

更多推荐

pymongo.errors.DocumentTooLarge: ‘aggregate‘ command document too large 解决

本文发布于:2024-03-07 23:27:03，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1719200.html