简体   繁体   中英

Perform Aggregation/Set intersection on MongoDB

I have a query, consider the following example as a intermediate data after performing some aggregation on a sample dataset;

fileid field contains the id of a file, and the user array containing array of users, who made some changes to the respective file

{
   “_id” : {  “fileid” : 12  },
   “_user” : [ “a”,”b”,”c”,”d” ]
}
{
   “_id” : {  “fileid” : 13  },
   “_user” : [ “f”,”e”,”a”,”b” ]
}
{
   “_id” : {  “fileid” : 14  },
   “_user” : [ “g”,”h”,”m”,”n” ]
}
{
   “_id” : {  “fileid” : 15  },
   “_user” : [ “o”,”r”,”s”,”v” ]
}
{
   “_id” : {  “fileid” : 16  },
   “_user” : [ “x”,”y”,”z”,”a” ]
}
{
   “_id” : {  “fileid” : 17  },
   “_user” : [ “g”,”r”,”s”,”n” ]
}

I need to find solution for this -> any two users that did some changes to atleast two of the same file. So the output-result should be

{
   “_id” : {  “fileid” : [12,13]  },
   “_user” : [ “a”,”b”]
}
{
   “_id” : {  "fileid” : [14,17]  },
   “_user” : [ “g”,”n” ]
}
{
   “_id” : {  "fileid” : [15,17]  },
   “_user” : [ “r”,”s” ]
}

Your inputs are highly appreciated.

This is a somewhat involved solution. The idea is to first use the DB to get the population of possible pairs, then turn around and ask the DB to find the pairs in the _user field. Beware that 1000s of users will create a pretty darn big pairing list. We use $addFields just in case there's more to the input records than we see in the example, but if not, for efficiency replace with $project to cut down the amount of material flowing through the pipe.

//
// Stage 1:  Get unique set of username pairs.
//
c=db.foo.aggregate([
{$unwind: "$_user"}

// Create single deduped list of users:
,{$group: {_id:null, u: {$addToSet: "$_user"} }}

// Nice little double map here creates the pairs, effectively doing this:
//    for index in range(0, len(list)):
//      first = list[index]
//      for p2 in range(index+1, len(list)):
//        pairs.append([first,list[p2]])
// 
,{$addFields: {u: 
  {$map: {
    input: {$range:[0,{$size:"$u"}]},
    as: "z",
    in: {
        $map: {
            input: {$range:[{$add:[1,"$$z"]},{$size:"$u"}]},
            as: "z2",
            in: [
            {$arrayElemAt:["$u","$$z"]},
            {$arrayElemAt:["$u","$$z2"]}
            ]
        }
    }
    }}
}}

// Turn the array of array of pairs in to a nice single array of pairs:
,{$addFields: {u: {$reduce:{
        input: "$u",
        initialValue:[],
        in:{$concatArrays: [ "$$value", "$$this"]}
        }}
    }}
          ]);


// Stage 2:  Find pairs and tally up the fileids

doc = c.next(); // Get single output from Stage 1 above.                       

u = doc['u'];

c2=db.foo.aggregate([
{$addFields: {_x: {$map: {
                input: u,
                as: "z",
                in: {
                    n: "$$z",
                    q: {$setIsSubset: [ "$$z", "$_user" ]}
                }
            }
        }
    }}
,{$unwind: "$_x"}
,{$match: {"_x.q": true}}
//  Nice use of grouping by an ARRAY here:
,{$group: {_id: "$_x.n", v: {$push: "$_id.fileid"}, n: {$sum:1} }}
,{$match: {"n": {"$gt":1}}}
                     ]);

show(c2);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM