简体   繁体   中英

Pymongo slow with aggregate on date

Perhaps this is my ignorance showing, but I had a query that appeared fast when my time frame was small, but as soon as I ran a query with a different date on the query the thing ground to a halt quite quickly. It appears as though matching on a date (or timestamp) field, even though it's indexed isn't very efficient - or I'm just doing it wrong.

Here's the data format:

alarm_data = {
  "alarm_global_id": int,
  "alarm_severity": int,
  "alarm_date": float,
  "created": float,
  "new_status": bool,
  "exp_day_status": False,
  "exp_week_status": False,
  "exp_month_status": False,
  "exp_months_status": False,
  "time_in_alarm": float,
}

I have the following indexes:

db.events.create_index("alarm_global_id", name="alarm_global_id")
db.events.create_index([("new_status", ASCENDING)], name="new")
db.events.create_index([("alarm_date",DESCENDING), ("exp_day_status",DESCENDING)], name="exp_day")
db.events.create_index([("alarm_date",DESCENDING), ("exp_week_status",DESCENDING)], name="exp_week")
db.events.create_index([("alarm_date",DESCENDING), ("exp_month_status",DESCENDING)], name="exp_month")
db.events.create_index([("alarm_date",DESCENDING), ("exp_months_status",DESCENDING)], name="exp_months")

The alarm_date field is a timestamp - so effectively a float. Perhaps a real datetime object sorts better?

In any case, the idea is to use this as a means to calculate long term aggregates ~once per minute or so without performing a giant full collection scan. The method is to take all new data as an increment, and all data beyond the expiry time as a negative and the sum of those is the resultant change in the last n seconds. The method for handling new data is:

db.events.aggregate([
  { "$match": {"new_status":True}},
  { "$group": {"_id": "$alarm_global_id", "time":{"$sum":"$time_in_alarm"}, "count": {"$sum":1} } }
])

Then we set the new_status to false so it's not found next time.

For calculating the day aggregates, we simply match on those that are not new whose alarm_date is less than now - expiry :

db.events.aggregate([
  { "$match": {"exp_day_status":False, "alarm_date":{"$lt":time.time()-(60*60*24)}} },
  { "$group": {"_id": "$alarm_global_id", "time":{"$sum":"$time_in_alarm"}, "count": {"$sum":1} } }
])

Again, we update the exp_day_status to True when done to indicate the document doesn't need to be included in the calc again. The same process repeats for week, month, and months versions, just with updated expiry times.

When I ran a test with 60 documents/s being written and the ranges were set to 10s, 30s, 60s, and 120s (instead of day, week, month, months values), it was very quick ~20-30ms to calculate and update the whole thing - even when the collection reached 3.5M documents there was no apparent drop in speed for the calculation with collection size. But as soon as I changed the times to 1min, 5 mins, 10mins, and 20mins things fell apart. Interestingly, at least to me, the calc time remained small for each expiry time until it ticked over the threshold and then that calculation became very slow.

Here's the calculation times for each stage:

  DB SIZE:  5237
  --------------------
  calculation times (ms):
  new:      3.04
  day:      3.41
  week:     1.00
  month:    1.00
  months:   0.96
  update:   13.05
  total:    24.05


  
  DB SIZE:  28590
  --------------------
  calculation times (ms):
  new:      4.00
  day:      46.02
  week:     39.00
  month:    39.00
  months:   39.01
  update:   203.00
  total:    370.03

**Note here that the result for all of the calculations in day-months are 0 - there are no new results and no documents that meet the criteria, so why is it so slow. Yet if I have the times at 10s, 20s, 30s, and 60s, it stays fast?

as soon as 1 minute ticks over, the day calculation goes up to ~30ms. The same happens when we tick past the week time period - the calculation time goes up to 60ms. If we try to go to an hour, then it's awfully large so performing this once a second ends up taking longer than a second. The interesting thing for me is that the time to calculate the day week is very quick (like 1-2ms) until that time ticks over then it suddenly scans the whole collection or something - if it can accurately reduce the number read, that would surely increase the speed to something nicely performant.

I do understand the idea that as the collection gets larger the time to query will be longer but I, perhaps naively, assumed that if I was only returning 50 results with a tight query that it wouldn't increase to many seconds for 30 minutes of results as it should quickly reject anything newer than the expiry time and without the appropriate boolean on the expired field so as to seriously speed up the query.

If this is purely expected behavior for this set up, please let me know that I'm just asking too much of any system to perform this task.

Update Here's the output of the explain method on the aggregation:

{'explainVersion': '1',
 'stages': [{'$cursor': {'queryPlanner': {'namespace': 'events.events',
     'indexFilterSet': False,
     'parsedQuery': {'$and': [{'exp_week_status': {'$eq': False}},
       {'alarm_date': {'$lt': 1636715435.6099443}}]},
     'queryHash': '6B9D5528',
     'planCacheKey': '21EBBA73',
     'maxIndexedOrSolutionsReached': False,
     'maxIndexedAndSolutionsReached': False,
     'maxScansToExplodeReached': False,
     'winningPlan': {'stage': 'PROJECTION_SIMPLE',
      'transformBy': {'alarm_global_id': 1, 'time_in_alarm': 1, '_id': 0},
      'inputStage': {'stage': 'FETCH',
       'filter': {'exp_week_status': {'$eq': False}},
       'inputStage': {'stage': 'IXSCAN',
        'keyPattern': {'alarm_date': 1, 'exp_day_status': -1},
        'indexName': 'exp_day',
        'isMultiKey': False,
        'multiKeyPaths': {'alarm_date': [], 'exp_day_status': []},
        'isUnique': False,
        'isSparse': False,
        'isPartial': False,
        'indexVersion': 2,
        'direction': 'forward',
        'indexBounds': {'alarm_date': ['[-inf.0, 1636715435.609944)'],
         'exp_day_status': ['[MaxKey, MinKey]']}}}},
     'rejectedPlans': [{'stage': 'PROJECTION_SIMPLE',
       'transformBy': {'alarm_global_id': 1, 'time_in_alarm': 1, '_id': 0},
       'inputStage': {'stage': 'FETCH',
        'inputStage': {'stage': 'IXSCAN',
         'keyPattern': {'alarm_date': 1, 'exp_week_status': -1},
         'indexName': 'exp_week',
         'isMultiKey': False,
         'multiKeyPaths': {'alarm_date': [], 'exp_week_status': []},
         'isUnique': False,
         'isSparse': False,
         'isPartial': False,
         'indexVersion': 2,
         'direction': 'forward',
         'indexBounds': {'alarm_date': ['[-inf.0, 1636715435.609944)'],
          'exp_week_status': ['[false, false]']}}}},
      {'stage': 'PROJECTION_SIMPLE',
       'transformBy': {'alarm_global_id': 1, 'time_in_alarm': 1, '_id': 0},
       'inputStage': {'stage': 'FETCH',
        'filter': {'exp_week_status': {'$eq': False}},
        'inputStage': {'stage': 'IXSCAN',
         'keyPattern': {'alarm_date': 1, 'exp_month_status': -1},
         'indexName': 'exp_month',
         'isMultiKey': False,
         'multiKeyPaths': {'alarm_date': [], 'exp_month_status': []},
         'isUnique': False,
         'isSparse': False,
         'isPartial': False,
         'indexVersion': 2,
         'direction': 'forward',
         'indexBounds': {'alarm_date': ['[-inf.0, 1636715435.609944)'],
          'exp_month_status': ['[MaxKey, MinKey]']}}}},
      {'stage': 'PROJECTION_SIMPLE',
       'transformBy': {'alarm_global_id': 1, 'time_in_alarm': 1, '_id': 0},
       'inputStage': {'stage': 'FETCH',
        'filter': {'exp_week_status': {'$eq': False}},
        'inputStage': {'stage': 'IXSCAN',
         'keyPattern': {'alarm_date': 1, 'exp_months_status': -1},
         'indexName': 'exp_months',
         'isMultiKey': False,
         'multiKeyPaths': {'alarm_date': [], 'exp_months_status': []},
         'isUnique': False,
         'isSparse': False,
         'isPartial': False,
         'indexVersion': 2,
         'direction': 'forward',
         'indexBounds': {'alarm_date': ['[-inf.0, 1636715435.609944)'],
          'exp_months_status': ['[MaxKey, MinKey]']}}}}]}}},
  {'$group': {'_id': '$alarm_global_id',
    'time': {'$sum': '$time_in_alarm'},
    'count': {'$sum': {'$const': 1}}}}],
 'serverInfo': {'host': 'PC0V4SFH',
  'port': 27017,
  'version': '5.0.3',
  'gitVersion': '657fea5a61a74d7a79df7aff8e4bcf0bc742b748'},
 'serverParameters': {'internalQueryFacetBufferSizeBytes': 104857600,
  'internalQueryFacetMaxOutputDocSizeBytes': 104857600,
  'internalLookupStageIntermediateDocumentMaxSizeBytes': 104857600,
  'internalDocumentSourceGroupMaxMemoryBytes': 104857600,
  'internalQueryMaxBlockingSortMemoryUsageBytes': 104857600,
  'internalQueryProhibitBlockingMergeOnMongoS': 0,
  'internalQueryMaxAddToSetBytes': 104857600,
  'internalDocumentSourceSetWindowFieldsMaxMemoryBytes': 104857600},
 'command': {'aggregate': 'events',
  'pipeline': [{'$match': {'alarm_date': {'$lt': 1636715435.6099443},
     'exp_week_status': False}},
   {'$group': {'_id': '$alarm_global_id',
     'time': {'$sum': '$time_in_alarm'},
     'count': {'$sum': 1}}}],
  'explain': True,
  'lsid': {'id': UUID('9fbae31f-dbc5-4b52-85f2-5bc7eba82bfc')},
  '$db': 'events',
  '$readPreference': {'mode': 'primaryPreferred'}},
 'ok': 1.0}

So the solution was strange to me and there must be a reason for it, but the order of the fields in the compound index mattered a great deal. The new index has the $lt date second and the status first:

db.events.create_index([("exp_day_status",DESCENDING), ("alarm_date",DESCENDING)], name="exp_day")], name="exp_day")

Simply by changing the order so that exp_day_status was first and the alarm_date was second completely changed everything. We're nearly at 6 hours now with more than 1.2M documents and there's no change in the calculation times since we started (except the tiny increase in each section as it comes of age):


  DB SIZE:  1244217
  --------------------
  FIELD    TIME(ms)    COUNT
  new:      2.01        65
  30mins:   1.05        64
  1hr:      1.46        64
  2hr:      0.37        63
  months:   1.01        0
  update:   14.03       256
  total:    19.93
  --------------------
  avg:      29.27

The follow-up question is whether there's a better way to structure the whole process. We currently run the aggregate method to get , then perform an update_many() , which performs the same query. The new question is whether performing a basic find() first, then using that result to perform an update_one() or findOneAndUpdate() using the _id s from the first find would be quicker? Or is there a way I haven't come across in mongo that directly pipes the output of a find many to an update?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM