简体   繁体   中英

filter numpy array of datetimes by frequency of occurance

I have an array of over 2 million records, each record has a 10 minutes resolution timestamp in datetime.datetime format, as well as several other values in other columns.

I only want to retain the records which have timestamps that occur 20 or more times in the array. What's the fastest way to do this? I've got plenty of RAM, so I'm looking for processing speed.

I've tried [].count() in a list comprehension but started to lose the will to live waiting for it to finish. I've also tried numpy.bincount() but tragically it doesn't like datetime.datetime

Any suggestions would be much appreciated. Thanks!

I'm editing this to include the timings using np.unique based on the suggestion below. This is by far the best solution

In [10]: import pandas as pd
         import numpy as np
         from collections import Counter

         #create a fake data set 
         dates = pd.date_range("2012-01-01", "2015-01-01", freq="10min")
         dates = np.random.choice(dates, 2000000, replace=True)

Based on the suggestion below the following would be the fastest by far:

In [32]: %%timeit
         values, counts = np.unique(dates, return_counts=True)
         filtered_dates = values[counts>20]
         10 loops, best of 3: 150 ms per loop

Using counter you can create a dictionary of the counts of each item and then convert it to a pd.Series in order to do the filtering

In [11]: %%timeit
         foo = pd.Series(Counter(dates))
         filtered_dates = np.array(foo[foo > 20].index)
         1 loop, best of 3: 12.3 s per loop

This is isn't too bad for an array with 2 million items, vs the following:

In [12]: dates = list(dates)
         filtered_dates = [e for e in set(dates) if dates.count(e) > 20]

I'm not going to wait for the list comprehension version to finish...

Actually might try np.unique . In numpy v1.9+ unique can return some extras, like unique_indices , unique_inverse , unique_counts .

If you want to use pandas, it would be quite simple and probably quite fast. You could use a groupby filter . Something like:

out = df.groupby('timestamp').filter(lambda x: len(x) > 20)

Numpy is slower than pandas on these types of operations, as np.unique sorts, while the machinery in pandas doesn't need to. Further this is much more idiomatic.

Pandas

In [22]: %%timeit
   ....: i = Index(dates)
   ....: i[i.value_counts()>20]
   ....: 
10 loops, best of 3: 78.2 ms per loop

In [23]: i = Index(dates)

In [24]: i[i.value_counts()>20]
Out[24]: 
DatetimeIndex(['2013-06-16 20:40:00', '2013-05-28 03:00:00', '2013-10-31 19:50:00', '2014-06-20 13:00:00', '2013-07-08 21:40:00', '2012-02-26 17:00:00', '2013-01-02 15:40:00', '2012-08-24 02:00:00',
               '2014-10-17 08:20:00', '2012-07-27 20:10:00',
               ...
               '2014-08-07 05:10:00', '2014-05-21 08:10:00', '2014-03-09 12:50:00', '2013-05-10 02:30:00', '2013-04-15 20:20:00', '2012-06-23 05:20:00', '2012-07-06 16:10:00', '2013-02-14 12:20:00',
               '2014-10-27 03:10:00', '2013-09-04 12:00:00'],
              dtype='datetime64[ns]', length=2978, freq=None)

In [25]: len(i[i.value_counts()>20])
Out[25]: 2978

Numpy (from other soln)

In [26]: %%timeit
         values, counts = np.unique(dates, return_counts=True)
         filtered_dates = values[counts>20]
   ....: 
10 loops, best of 3: 145 ms per loop

In [27]: filtered_dates = values[counts>20]

In [28]: len(filtered_dates)
Out[28]: 2978
  1. Sort your array
  2. Count contiguous occurrences by going through it once, & filter for frequency >= 20

The running time is O(nlog(n)) whereas your list comprehension was probably O(n**2)... that makes quite a difference on 2 million entries.

Depending on how your data is structured, you might be able to sort only the axis and data you need from the numpy array that holds it.

Thanks for all of your suggestions.

I ended up doing something completely different with dictionaries in the end and found it much faster for the processing that I required.

I created a dictionary with a unique set of timestamps as the keys and empty lists as the values and then looped once through the unordered list (or array) and populated the value lists with the values that I wanted to count.

Thanks again!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM