如何使用Pymongo Aggregation查询此数据集？(How can I use Pymongo Aggregation to inquiry this dataset?)

这是我第一次使用MongoDB聚合查询。我的数据集如下：

{ // doc 1 "_id" : ObjectId("55f2481bc9b4cd1c0c198c9f"), "channels" : [ "channel_3", "channel_2", "channel_1", "channel_4" ], "msd" : 25, "uid" : "000012bb-2e5a-8bd3-d36a-fa037973e632" } { // doc 2 "_id" : ObjectId("55f2481bc9b4cd123452345f"), "channels" : [ "channel_3", "channel_4" ], "msd" : 50, "uid" : "000012bb-2e5a-8bd3-d36a-fa037973e632" } { // doc 3 "_id" : ObjectId("55f2481bc9b4cd1c0c198c9f"), "channels" : [ "channel_2" ], "msd" : 100, "uid" : "000012bb-2e5a-8bd3-d36a-fa037973e632" } { // doc 4 "_id" : ObjectId("55f2481bc9b4cd1c0c198c9f"), "channels" : [ "channel_2" ], "msd" : 80, "uid" : "000012bb-2e5a-8bd3-d36a-fa037973e632" }

我已经构建了一个复合索引：

userlog.create_index([('uid', ASCENDING), ('channels', ASCENDING)])

现在， 给定一个用户和一个通道数组，我想检索其中至少有一个通道位于查询通道中的msd的平均值。 例如，查询是：

{"uid" : "000012bb-2e5a-8bd3-d36a-fa037973e632", "channels" : ["channel_1", "channel_2"], }

doc 1的频道包含“channel_1”和“channel_2”，doc 3和4的频道包含“channels_2”。所以预期的回报值是（25 + 100 + 80）/ 3 = 68.33

======================试用1 =======================

码：

pipe=[ {"$unwind":'$channels'}, {"$match":{'uid':"000012bb-2e5a-8bd3-d36a-fa037973e632", 'channels':{'$in':channels}}}, {"$group":{'_id': '$channels', 'averageMSD':{'$avg':'$msd'}}} ] for res in db.aggregate(pipeline=pipe): print(res)

结果：

{'_id': 'channel_1', 'averageMSD': 25.0} {'_id': 'channel_2', 'averageMSD': 68.33333333333333}

似乎“$ unwind”使得doc 1意外地被计算两次。另外，“$ unwind”非常慢。

======================试用2 =======================

码：

pipe=[ {"$match":{'uid':"000012bb-2e5a-8bd3-d36a-fa037973e632", 'channels':{'$in':channels}}}, {"$group":{'_id': '$channels', 'averageMSD':{'$avg':'$msd'}}} ] for res in db.aggregate(pipeline=pipe): print(res)

结果：

{'averageMSD': 90.0, '_id': ['channel_2']} {'averageMSD': 25.0, '_id': ['channel_3', 'channel_2', 'channel_1', 'channel_4']}

结果仍然不是我想要的。似乎我不应该通过“渠道”对结果进行分组。但我不知道如何解决它。

如何使用聚合有效地查询数据库？

This is my first time use MongoDB aggregation query. The dataset I have is as follow:

{ // doc 1 "_id" : ObjectId("55f2481bc9b4cd1c0c198c9f"), "channels" : [ "channel_3", "channel_2", "channel_1", "channel_4" ], "msd" : 25, "uid" : "000012bb-2e5a-8bd3-d36a-fa037973e632" } { // doc 2 "_id" : ObjectId("55f2481bc9b4cd123452345f"), "channels" : [ "channel_3", "channel_4" ], "msd" : 50, "uid" : "000012bb-2e5a-8bd3-d36a-fa037973e632" } { // doc 3 "_id" : ObjectId("55f2481bc9b4cd1c0c198c9f"), "channels" : [ "channel_2" ], "msd" : 100, "uid" : "000012bb-2e5a-8bd3-d36a-fa037973e632" } { // doc 4 "_id" : ObjectId("55f2481bc9b4cd1c0c198c9f"), "channels" : [ "channel_2" ], "msd" : 80, "uid" : "000012bb-2e5a-8bd3-d36a-fa037973e632" }

I have built a compound index:

userlog.create_index([('uid', ASCENDING), ('channels', ASCENDING)])

Now, given a user and an array of channels, I want to retrieve the average of msd whose at least one channel is in the query channels. For instance, a query is:

{"uid" : "000012bb-2e5a-8bd3-d36a-fa037973e632", "channels" : ["channel_1", "channel_2"], }

The channels of doc 1 contains "channel_1" and "channel_2" and the channels of doc 3 and 4 contains "channels_2". So the expected return value is (25+100+80)/3 = 68.33

====================== TRIAL 1 =======================

CODE:

pipe=[ {"$unwind":'$channels'}, {"$match":{'uid':"000012bb-2e5a-8bd3-d36a-fa037973e632", 'channels':{'$in':channels}}}, {"$group":{'_id': '$channels', 'averageMSD':{'$avg':'$msd'}}} ] for res in db.aggregate(pipeline=pipe): print(res)

RESULT:

{'_id': 'channel_1', 'averageMSD': 25.0} {'_id': 'channel_2', 'averageMSD': 68.33333333333333}

It seems that "$unwind" makes the doc 1 is unexpectedly counted twice. Also, "$unwind" is quite slow.

====================== TRIAL 2 =======================

CODE:

pipe=[ {"$match":{'uid':"000012bb-2e5a-8bd3-d36a-fa037973e632", 'channels':{'$in':channels}}}, {"$group":{'_id': '$channels', 'averageMSD':{'$avg':'$msd'}}} ] for res in db.aggregate(pipeline=pipe): print(res)

RESULT:

{'averageMSD': 90.0, '_id': ['channel_2']} {'averageMSD': 25.0, '_id': ['channel_3', 'channel_2', 'channel_1', 'channel_4']}

The result is still not what I want. It seems that I should not group the result by "channels". But I do not how I can fix it.

How can I use aggregation to inquire the DB efficiently?

最满意答案

我想到了：

pipe=[ {"$match":{'uid':"000012bb-2e5a-8bd3-d36a-fa037973e632", 'channels':{'$in':channels}}}, {"$group":{'_id': None, 'averageMSD':{'$avg':'$msd'}}} ]

I figured it out:

pipe=[ {"$match":{'uid':"000012bb-2e5a-8bd3-d36a-fa037973e632", 'channels':{'$in':channels}}}, {"$group":{'_id': None, 'averageMSD':{'$avg':'$msd'}}} ]

更多推荐

如何使用Pymongo Aggregation查询此数据集？(How can I use Pymongo Aggregation to inquiry this dataset?)

最满意答案

发布评论取消回复

最近发表

热门文章

标签列表