I have an array of over 2 million records, each record has a 10 minutes resolution timestamp in datetime.datetime format, as well as several other values in other columns.
I only want to retain the records which have timestamps that occur 20 or more times in the array. What's the fastest way to do this? I've got plenty of RAM, so I'm looking for processing speed.
I've tried [].count() in a list comprehension but started to lose the will to live waiting for it to finish. I've also tried numpy.bincount() but tragically it doesn't like datetime.datetime
Any suggestions would be much appreciated. Thanks!
I'm editing this to include the timings using np.unique
based on the suggestion below. This is by far the best solution
In [10]: import pandas as pd
import numpy as np
from collections import Counter
#create a fake data set
dates = pd.date_range("2012-01-01", "2015-01-01", freq="10min")
dates = np.random.choice(dates, 2000000, replace=True)
Based on the suggestion below the following would be the fastest by far:
In [32]: %%timeit
values, counts = np.unique(dates, return_counts=True)
filtered_dates = values[counts>20]
10 loops, best of 3: 150 ms per loop
Using counter you can create a dictionary of the counts of each item and then convert it to a pd.Series
in order to do the filtering
In [11]: %%timeit
foo = pd.Series(Counter(dates))
filtered_dates = np.array(foo[foo > 20].index)
1 loop, best of 3: 12.3 s per loop
This is isn't too bad for an array with 2 million items, vs the following:
In [12]: dates = list(dates)
filtered_dates = [e for e in set(dates) if dates.count(e) > 20]
I'm not going to wait for the list comprehension version to finish...
Actually might try np.unique
. In numpy v1.9+ unique
can return some extras, like unique_indices
, unique_inverse
, unique_counts
.
If you want to use pandas, it would be quite simple and probably quite fast. You could use a groupby filter . Something like:
out = df.groupby('timestamp').filter(lambda x: len(x) > 20)
Numpy is slower than pandas on these types of operations, as np.unique
sorts, while the machinery in pandas doesn't need to. Further this is much more idiomatic.
Pandas
In [22]: %%timeit
....: i = Index(dates)
....: i[i.value_counts()>20]
....:
10 loops, best of 3: 78.2 ms per loop
In [23]: i = Index(dates)
In [24]: i[i.value_counts()>20]
Out[24]:
DatetimeIndex(['2013-06-16 20:40:00', '2013-05-28 03:00:00', '2013-10-31 19:50:00', '2014-06-20 13:00:00', '2013-07-08 21:40:00', '2012-02-26 17:00:00', '2013-01-02 15:40:00', '2012-08-24 02:00:00',
'2014-10-17 08:20:00', '2012-07-27 20:10:00',
...
'2014-08-07 05:10:00', '2014-05-21 08:10:00', '2014-03-09 12:50:00', '2013-05-10 02:30:00', '2013-04-15 20:20:00', '2012-06-23 05:20:00', '2012-07-06 16:10:00', '2013-02-14 12:20:00',
'2014-10-27 03:10:00', '2013-09-04 12:00:00'],
dtype='datetime64[ns]', length=2978, freq=None)
In [25]: len(i[i.value_counts()>20])
Out[25]: 2978
Numpy (from other soln)
In [26]: %%timeit
values, counts = np.unique(dates, return_counts=True)
filtered_dates = values[counts>20]
....:
10 loops, best of 3: 145 ms per loop
In [27]: filtered_dates = values[counts>20]
In [28]: len(filtered_dates)
Out[28]: 2978
Sort
your array frequency >= 20
The running time is O(nlog(n)) whereas your list comprehension was probably O(n**2)... that makes quite a difference on 2 million entries.
Depending on how your data is structured, you might be able to sort only the axis and data you need from the numpy array that holds it.
Thanks for all of your suggestions.
I ended up doing something completely different with dictionaries in the end and found it much faster for the processing that I required.
I created a dictionary with a unique set of timestamps as the keys and empty lists as the values and then looped once through the unordered list (or array) and populated the value lists with the values that I wanted to count.
Thanks again!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.