这是我第一次使用MongoDB聚合查询。 我的数据集如下:
{ // doc 1 "_id" : ObjectId("55f2481bc9b4cd1c0c198c9f"), "channels" : [ "channel_3", "channel_2", "channel_1", "channel_4" ], "msd" : 25, "uid" : "000012bb-2e5a-8bd3-d36a-fa037973e632" } { // doc 2 "_id" : ObjectId("55f2481bc9b4cd123452345f"), "channels" : [ "channel_3", "channel_4" ], "msd" : 50, "uid" : "000012bb-2e5a-8bd3-d36a-fa037973e632" } { // doc 3 "_id" : ObjectId("55f2481bc9b4cd1c0c198c9f"), "channels" : [ "channel_2" ], "msd" : 100, "uid" : "000012bb-2e5a-8bd3-d36a-fa037973e632" } { // doc 4 "_id" : ObjectId("55f2481bc9b4cd1c0c198c9f"), "channels" : [ "channel_2" ], "msd" : 80, "uid" : "000012bb-2e5a-8bd3-d36a-fa037973e632" }我已经构建了一个复合索引:
userlog.create_index([('uid', ASCENDING), ('channels', ASCENDING)])现在, 给定一个用户和一个通道数组,我想检索其中至少有一个通道位于查询通道中的msd的平均值。 例如,查询是:
{"uid" : "000012bb-2e5a-8bd3-d36a-fa037973e632", "channels" : ["channel_1", "channel_2"], }doc 1的频道包含“channel_1”和“channel_2”,doc 3和4的频道包含“channels_2”。 所以预期的回报值是(25 + 100 + 80)/ 3 = 68.33
======================试用1 =======================
码:
pipe=[ {"$unwind":'$channels'}, {"$match":{'uid':"000012bb-2e5a-8bd3-d36a-fa037973e632", 'channels':{'$in':channels}}}, {"$group":{'_id': '$channels', 'averageMSD':{'$avg':'$msd'}}} ] for res in db.aggregate(pipeline=pipe): print(res)结果:
{'_id': 'channel_1', 'averageMSD': 25.0} {'_id': 'channel_2', 'averageMSD': 68.33333333333333}似乎“$ unwind”使得doc 1意外地被计算两次。 另外,“$ unwind”非常慢。
======================试用2 =======================
码:
pipe=[ {"$match":{'uid':"000012bb-2e5a-8bd3-d36a-fa037973e632", 'channels':{'$in':channels}}}, {"$group":{'_id': '$channels', 'averageMSD':{'$avg':'$msd'}}} ] for res in db.aggregate(pipeline=pipe): print(res)结果:
{'averageMSD': 90.0, '_id': ['channel_2']} {'averageMSD': 25.0, '_id': ['channel_3', 'channel_2', 'channel_1', 'channel_4']}结果仍然不是我想要的。 似乎我不应该通过“渠道”对结果进行分组。 但我不知道如何解决它。
如何使用聚合有效地查询数据库?
This is my first time use MongoDB aggregation query. The dataset I have is as follow:
{ // doc 1 "_id" : ObjectId("55f2481bc9b4cd1c0c198c9f"), "channels" : [ "channel_3", "channel_2", "channel_1", "channel_4" ], "msd" : 25, "uid" : "000012bb-2e5a-8bd3-d36a-fa037973e632" } { // doc 2 "_id" : ObjectId("55f2481bc9b4cd123452345f"), "channels" : [ "channel_3", "channel_4" ], "msd" : 50, "uid" : "000012bb-2e5a-8bd3-d36a-fa037973e632" } { // doc 3 "_id" : ObjectId("55f2481bc9b4cd1c0c198c9f"), "channels" : [ "channel_2" ], "msd" : 100, "uid" : "000012bb-2e5a-8bd3-d36a-fa037973e632" } { // doc 4 "_id" : ObjectId("55f2481bc9b4cd1c0c198c9f"), "channels" : [ "channel_2" ], "msd" : 80, "uid" : "000012bb-2e5a-8bd3-d36a-fa037973e632" }I have built a compound index:
userlog.create_index([('uid', ASCENDING), ('channels', ASCENDING)])Now, given a user and an array of channels, I want to retrieve the average of msd whose at least one channel is in the query channels. For instance, a query is:
{"uid" : "000012bb-2e5a-8bd3-d36a-fa037973e632", "channels" : ["channel_1", "channel_2"], }The channels of doc 1 contains "channel_1" and "channel_2" and the channels of doc 3 and 4 contains "channels_2". So the expected return value is (25+100+80)/3 = 68.33
====================== TRIAL 1 =======================
CODE:
pipe=[ {"$unwind":'$channels'}, {"$match":{'uid':"000012bb-2e5a-8bd3-d36a-fa037973e632", 'channels':{'$in':channels}}}, {"$group":{'_id': '$channels', 'averageMSD':{'$avg':'$msd'}}} ] for res in db.aggregate(pipeline=pipe): print(res)RESULT:
{'_id': 'channel_1', 'averageMSD': 25.0} {'_id': 'channel_2', 'averageMSD': 68.33333333333333}It seems that "$unwind" makes the doc 1 is unexpectedly counted twice. Also, "$unwind" is quite slow.
====================== TRIAL 2 =======================
CODE:
pipe=[ {"$match":{'uid':"000012bb-2e5a-8bd3-d36a-fa037973e632", 'channels':{'$in':channels}}}, {"$group":{'_id': '$channels', 'averageMSD':{'$avg':'$msd'}}} ] for res in db.aggregate(pipeline=pipe): print(res)RESULT:
{'averageMSD': 90.0, '_id': ['channel_2']} {'averageMSD': 25.0, '_id': ['channel_3', 'channel_2', 'channel_1', 'channel_4']}The result is still not what I want. It seems that I should not group the result by "channels". But I do not how I can fix it.
How can I use aggregation to inquire the DB efficiently?
最满意答案
我想到了:
pipe=[ {"$match":{'uid':"000012bb-2e5a-8bd3-d36a-fa037973e632", 'channels':{'$in':channels}}}, {"$group":{'_id': None, 'averageMSD':{'$avg':'$msd'}}} ]I figured it out:
pipe=[ {"$match":{'uid':"000012bb-2e5a-8bd3-d36a-fa037973e632", 'channels':{'$in':channels}}}, {"$group":{'_id': None, 'averageMSD':{'$avg':'$msd'}}} ]更多推荐
发布评论