errors.DocumentTooLarge: ‘aggregate‘ command document too large 解决"/>
pymongo.errors.DocumentTooLarge: ‘aggregate‘ command document too large 解决
先说明:本文没有介绍直接快速的方式(例如:加个参数,改个配置等),而是通过转换思路解决的,介意勿看!
最近在使用mongo做聚合统计的时候遇到的问题,经过一通查询,知道mongdb和pymongo16MB有文档大小限制。之前只知道mongo有一条document文档会有16MB的限制,这次聚合match查询条件里的list大小也报了超限!
需求:有个mysql订单库和mongo订单库,mongo里存的订单包含mysql存的订单,目标是统计在mongo里不在mysql里的订单的电量和订单数!
之前的代码:
filter_pipe = {"StartType": 3, "EndTime": {"$lte": end_date, "$gte": start_date}, "OrderID": {"$nin": all_mysql_orders}}
ret = db.get_collection("table_name").aggregate([{"$match": filter_pipe},{"$group": {"_id": {}, "totalPower": {"$sum": "$TotalPower"}, "count": {"$sum": 1}}}])
all_mysql_orders 有24MB,超限报了错:
File "/usr/local/python3/lib/python3.7/site-packages/pymongo/message.py", line 1070, in _raise_document_too_large
raise DocumentTooLarge("%r command document too large" % (operation,))
pymongo.errors.DocumentTooLarge: 'aggregate' command document too large
问了同事,说可以:"加大mongo的内存配置,然后重启。因为数据库是基于内存做的,另外还需要扩大buffer",但这是生产的数据,不能做这种冒失的操作,万一搞崩了,直接凉凉!
于是想到曲线救国,修改统计程序,找出所有满足条件的OrderID把$nin改成 $in,不就可以了,但是经过修改重跑后发现in后的订单列表数据量比24MB还大,这条路也不行了!
最后尝试分页聚合修改,先跑一个分页聚合的测试方法:
def test_method():aggregate_num = 5diff = [i for i in range(0, 11)]print(diff)# 整页page = len(diff) // aggregate_num# 剩余条数left = len(diff) % aggregate_num# 能被整除的情况if page and not left:print("page: {},left:{}".format(page, 0))# 不能被整除的情况else:print("page: {},left:{}".format(page, left))page += 1# 初始总数据for i in range(page):begin_index = i * aggregate_numend_index = (i + 1)*aggregate_numprint(begin_index, end_index, diff[begin_index: end_index])
没毛病,修改统计程序的代码为分页聚合,修改后的代码:
filter_pipes = {"StartType": 3, "EndTime": {"$lte": end_date, "$gte": start_date}}
ret0 = db.get_collection("table_name").find(filter_pipes, {"OrderID": 1})
# mongo满足条件的总订单数
all_in_mongo_ids = set([order_info['OrderID'] for order_info in ret0])
# 40万数据10.3MB,一个订单号是27bytes,可以使用聚合方法
aggregate_num = 400000
# 需要聚合统计订单的订单号
diff = all_in_mongo_ids.difference(set(all_mysql_ids))
# 整页
page = len(diff) // aggregate_num
# 剩余条数
left = len(diff) % aggregate_num
# 能被整除的情况
if page and not left:print("page: {},left:{}".format(page, 0))
# 不能被整除的情况
else:print("page: {},left:{}".format(page, left))page += 1
diff_list = list(diff)
total_power, order_num = 0, 0
for i in range(page):begin_index = i * aggregate_numend_index = (i + 1) * aggregate_numfilter_pipes.update({"OrderID": {"$in": diff_list[begin_index: end_index]}})ret = db.get_collection("table_name").aggregate([{"$match": filter_pipes},{"$group": {"_id": {}, "totalPower": {"$sum": "$TotalPower"}, "count": {"$sum": 1}}}])result = [info for info in ret]order_num += result [0]['count']total_power += result [0]['totalPower']
跑起来很快,交差!
还有同事提出,可以将mysql里订单用dict的形式取出,基于hash判断在不在字典里,会快些,虽然判断会快些,但查出mongo所有的数据(300多万条),再进行循环统计电量还是会拉慢速度!
注:希望文章可以帮到你,如果有问题的地方,可以留言一起交流!
更多推荐
pymongo.errors.DocumentTooLarge: ‘aggregate‘ command document too large 解决
发布评论