简体   繁体   中英

MongoDB Aggregation - $lookup performance

I'm using MongoDB 3.6 aggregation with lookup in order to Join two collections (users and subscriptionusers).

var UserSchema = mongoose.Schema({
  email:{
    type: String,
    trim: true,
    unique: true,
  },
  name: {
    type: String,
    required: true,
    trim: true,
  },
  password: String,
  gender: { type: String, enum: ['male', 'female', 'unknown'], default: 'unknown'},
  age_range: { type: String, enum: [12, 16, 18], default: 18},
  country: {type:String, default:'co'}
});

var SuscriptionUsersSchema = mongoose.Schema({
  user_id: {
    ref: 'Users',
    type: mongoose.Schema.ObjectId
  },
  channel_id: {
    ref: 'Channels',
    type: mongoose.Schema.ObjectId
  },
  subscribed: {type: Boolean, default:false},
  unsubscribed_at: Date,
  subscribed_at: Date
});

My goal is to query into suscriptionusers and join users collection, matching a start and end date, in order to get some analytics of subscriptions, like country, age range and gender of users subscribed, and show the data in a line chart. I'm doing this way:

db.getCollection('suscriptionusers').aggregate([
{$match: {
    'channel_id': ObjectId('......'),
    'subscribed_at': {
            $gte: new Date('2018-01-01'),
            $lte: new Date('2019-01-01'),
    },
    'subscribed': true
}},     
{
    $lookup:{
        from: "users",      
        localField: "user_id", 
        foreignField: "_id",
        as: "users"        
    }
},
/*  Implementing this form instead the earlier (above), make the process even slower :(
 {$lookup:
 {
   from: "users",
   let: { user_id: "$user_id" },
   pipeline: [
      { $match:
          { $expr:
             {$eq: [ "$_id",  "$$user_id" ]}
          }
      },
      { $project: { age_range:1, country: 1, gender:1 } }
   ],
   as: "users"
 }
},*/
{$unwind: {
    path: "$users",
    preserveNullAndEmptyArrays: false
}},
{$project: {
    'users.age_range': 1, 
    'users.country': 1, 
    'users.gender': 1, 
    '_id': 1, 
    'subscribed_at': { $dateToString: { format: "%Y-%m", date: "$subscribed_at" } },
    'unsubscribed_at': { $dateToString: { format: "%Y-%m", date: "$unsubscribed_at" } }
}},
])

The main concern is about performance. For example, for about 150.000 subscribers, the query is taking around 7~8 seconds to retrieve information, and I'm afraid of what could happen for million subscribers, due to even if I conditionate a limit for records (for example retrieve only data between two months), there is the possibility of hundreds of subscribers between that period.

I have already tried creating an index for subscriptionusers collection, for user_id field, however, there is not an improvement.

db.getCollection('suscriptionusers').ensureIndex({user_id: 1});

My question is, should I save the fields (country, age_range, and gender) also in subscriptionusers collection? because if I query without the lookup for users collection, the process is fast enough.

Or is there a better way to improve the performance using my current scheme?

Thank a lot :)

Edit: Just to take into account, the user could be subscribed to multiple channels, and it's because of that, the subscription is not saved inside users collection

Well, maybe is not the best method, but I just included the fields needed from the UserSchema into the SuscriptionUsersSchema. This is notably faster for the analytics purpose. Also, I figured out that analytics record must be unchanged in the time, in order to keep the data as it was generated at the moment. So by using the data in this way, even if the user changes her/his information, or deletes the account, the data will remain unchanged. If you have any advise, please feel free to share it :)

Just for reference, my SuscriptionUsersSchema now looks like:

    var SuscriptionUsersSchema = mongoose.Schema({
  user_id: {
    ref: 'Users',
    type: mongoose.Schema.ObjectId
  },
  channel_id: {
    ref: 'Channels',
    type: mongoose.Schema.ObjectId
  },
  subscribed: {type: Boolean, default:false},
  gender: { type: String, enum: ['male', 'female', 'unknown'], default: 'unknown'},
  age_range: { type: String, enum: [12, 16, 18], default: 18},
  country: {type:String, default:'co'}
  unsubscribed_at: Date,
  subscribed_at: Date
});

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM