简体   繁体   English

在MongoDB上执行聚合/设置交集

[英]Perform Aggregation/Set intersection on MongoDB

I have a query, consider the following example as a intermediate data after performing some aggregation on a sample dataset; 我有一个查询,对示例数据集进行一些汇总后,将以下示例视为中间数据;

fileid field contains the id of a file, and the user array containing array of users, who made some changes to the respective file fileid字段包含文件的ID,user数组包含用户的用户数组,这些用户对相应文件进行了一些更改

{
   “_id” : {  “fileid” : 12  },
   “_user” : [ “a”,”b”,”c”,”d” ]
}
{
   “_id” : {  “fileid” : 13  },
   “_user” : [ “f”,”e”,”a”,”b” ]
}
{
   “_id” : {  “fileid” : 14  },
   “_user” : [ “g”,”h”,”m”,”n” ]
}
{
   “_id” : {  “fileid” : 15  },
   “_user” : [ “o”,”r”,”s”,”v” ]
}
{
   “_id” : {  “fileid” : 16  },
   “_user” : [ “x”,”y”,”z”,”a” ]
}
{
   “_id” : {  “fileid” : 17  },
   “_user” : [ “g”,”r”,”s”,”n” ]
}

I need to find solution for this -> any two users that did some changes to atleast two of the same file. 我需要为此找到解决方案->任何两个对同一文件中的至少两个进行了一些更改的用户。 So the output-result should be 所以输出结果应该是

{
   “_id” : {  “fileid” : [12,13]  },
   “_user” : [ “a”,”b”]
}
{
   “_id” : {  "fileid” : [14,17]  },
   “_user” : [ “g”,”n” ]
}
{
   “_id” : {  "fileid” : [15,17]  },
   “_user” : [ “r”,”s” ]
}

Your inputs are highly appreciated. 非常感谢您的投入。

This is a somewhat involved solution. 这是一个有点涉及的解决方案。 The idea is to first use the DB to get the population of possible pairs, then turn around and ask the DB to find the pairs in the _user field. 想法是首先使用数据库来获取可能的对的总数,然后转过身,要求数据库在_user字段中找到对。 Beware that 1000s of users will create a pretty darn big pairing list. 请注意,成千上万的用户将创建一个相当大的配对列表。 We use $addFields just in case there's more to the input records than we see in the example, but if not, for efficiency replace with $project to cut down the amount of material flowing through the pipe. 我们使用$addFields只是为了防止输入记录比示例中看到的更多,但是如果没有,为了提高效率,请替换为$project以减少流经管道的材料量。

//
// Stage 1:  Get unique set of username pairs.
//
c=db.foo.aggregate([
{$unwind: "$_user"}

// Create single deduped list of users:
,{$group: {_id:null, u: {$addToSet: "$_user"} }}

// Nice little double map here creates the pairs, effectively doing this:
//    for index in range(0, len(list)):
//      first = list[index]
//      for p2 in range(index+1, len(list)):
//        pairs.append([first,list[p2]])
// 
,{$addFields: {u: 
  {$map: {
    input: {$range:[0,{$size:"$u"}]},
    as: "z",
    in: {
        $map: {
            input: {$range:[{$add:[1,"$$z"]},{$size:"$u"}]},
            as: "z2",
            in: [
            {$arrayElemAt:["$u","$$z"]},
            {$arrayElemAt:["$u","$$z2"]}
            ]
        }
    }
    }}
}}

// Turn the array of array of pairs in to a nice single array of pairs:
,{$addFields: {u: {$reduce:{
        input: "$u",
        initialValue:[],
        in:{$concatArrays: [ "$$value", "$$this"]}
        }}
    }}
          ]);


// Stage 2:  Find pairs and tally up the fileids

doc = c.next(); // Get single output from Stage 1 above.                       

u = doc['u'];

c2=db.foo.aggregate([
{$addFields: {_x: {$map: {
                input: u,
                as: "z",
                in: {
                    n: "$$z",
                    q: {$setIsSubset: [ "$$z", "$_user" ]}
                }
            }
        }
    }}
,{$unwind: "$_x"}
,{$match: {"_x.q": true}}
//  Nice use of grouping by an ARRAY here:
,{$group: {_id: "$_x.n", v: {$push: "$_id.fileid"}, n: {$sum:1} }}
,{$match: {"n": {"$gt":1}}}
                     ]);

show(c2);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM