[英]Fastest way to calculate average of datetime rows using pandas
I have 122864 row of data. 我有122864行数据。 I am storing data in HDF5 file.
我将数据存储在HDF5文件中。 Using pandas for data processing.
使用熊猫进行数据处理。 For each unique id in record there is a timestamp associated indicating time when user opened an app.
对于记录中的每个唯一ID,都有一个时间戳,指示用户打开应用程序的时间。 I want to get average duration between two hits of app.
我想获得两次应用点击之间的平均持续时间。
1283 2015-04-01 08:07:44.131768
1284 2015-04-01 08:08:02.752611
1285 2015-04-01 08:08:02.793380
1286 2015-04-01 08:07:53.910469
1287 2015-04-01 08:08:03.305893
1288 2015-04-01 08:07:44.843050
1289 2015-04-01 08:07:54.767203
1290 2015-04-01 08:08:03.965367
1291 2015-04-01 08:07:45.924854
1292 2015-04-01 08:07:55.408593
1293 2015-04-01 08:07:46.365128
class User(object):
'''
Properties and function related to each object.
attributes:
datetime: a list of hit timestamp for each user object
deviceid: unique deviceid
'''
def __init__(self, User, device_id):
self.datetime = pd.to_datetime(list(User['datetime']))
self.deviceid = device_id
self.avrgtime = 0.0
avgtime.setdefault(self.deviceid, 1)
def avg_duration(self):
'''
average duration b/w hits for each user.
'''
for i,time in enumerate(self.datetime[:-1]):
self.avrgtime += abs(self.datetime[i+1] - time).total_seconds()
avgtime[self.deviceid] = self.avrgtime/len(self.datetime)
pp.pprint(avgtime)
#avgtime[] = datetime.strptime(time, '%Y-%m-%d %H:%M:%S.%f')
pass
def eachdevice(gstore):
count = 0
for did in list(gstore['data'].drop_duplicates('device_id')['device_id']):
auser = gstore.select('data', where="device_id == did")
gamer = User(auser, did)
gamer.avg_duration()
count+=1
print count
#main workshore
if __name__ == '__main__':
try:
path = os.path.abspath(sys.argv[1])
with pd.HDFStore('Gamer.h5') as gstore:
eachdevice(gstore)
except IndexError:
print('\nPass path of the HDF5 file to be analyized...EXITING\n')
What I am doing till now is looping through each unique_id and using pandas dataframe select querying datetime for each unique id. 到目前为止,我正在做的是遍历每个unique_id,并使用pandas dataframe select查询每个唯一ID的日期时间。 This returns datetime object dataframe.
这将返回日期时间对象数据框。 I convert this to list and then loop to calculate average difference between two timestamps.
我将此转换为列表,然后循环计算两个时间戳之间的平均差。 This approach takes lots of time.
这种方法需要很多时间。 Is there any way to do this in using pandas?
使用熊猫有什么方法吗?
Please help. 请帮忙。
EDIT: after commenting out all the calculation part i run the code. 编辑:注释掉所有计算部分后,我运行代码。 I think this auser = gstore.select('data', where="device_id == did") is taking all the time.
我认为这个auser = gstore.select('data',where =“ device_id == did”)正在花费所有时间。 How to improve?
怎么提高? any alternative or better way?
还有其他更好的方法吗? %timeit result :1 loops, best of 3: 13.3 s per loop for 1000 iterations.
%timeit结果:1个循环,每循环最好3:13.3 s,可进行1000次迭代。
Edit: Sample data: 编辑:样本数据:
device_id datetime
0 c4be7e55d98914647c51329edc2ab734 2015-03-30 22:00:05.922317
1 05fed9f8e07c3cac457723286d36f621 2015-03-30 22:00:07.895672
2 783faeed9fe35a3f45b521b3a6667a2d 2015-03-30 22:00:05.529631
3 c2022ad838cec35bdb12fc3a6e2cf452 2015-03-30 21:59:59.043905
4 a8a04268ee0c22b26af59e053390cf6f 2015-03-30 22:00:14.248542
5 4e5ed16b44b9cd38c408859d1d241e2d 2015-03-30 22:00:02.391719
6 c0bfd3f9046855ffaaec4d99c367fd8c 2015-03-30 22:00:18.649193
7 95f1182c6e4d601ba0b20f5204168ecb 2015-03-30 22:00:13.629728
8 a85caa7e0a4a7d57e6330c083daff326 2015-03-30 22:00:08.340469
9 46cdbee963814cdb4e6a0ac0049b8fc6 2015-03-30 22:00:23.152820
10 3c8bf70679cd9c6f18aa52d06e0e181d 2015-03-30 22:00:17.619251
11 52bc4e3d9dc373d89ec31effe10e6f30 2015-03-30 22:00:11.591954
12 3477eb25e26b6bff0bfc6c3ee59a5f40 2015-03-30 22:00:25.745083
13 e7bf8ae864f2148831628a6f2e8e406e 2015-03-30 22:00:20.911568
14 a15af8faffd655a3e80f85840bbf3c2a 2015-03-30 22:00:19.017887
15 9d9f71f080c0cf478ec4117e78ff89ee 2015-03-30 22:00:28.435585
16 1633d88738316e3602890499b1f778b1 2015-03-30 22:00:24.108234
17 3362daf99f11541acbf45e70fdaf5f49 2015-03-30 22:00:24.512366
18 96c3c005eaaaa8d6af3f2443ca8f73df 2015-03-30 22:00:29.713550
19 002642b9ed495f84318fcb42557f53e1 2015-03-30 22:00:37.936647
Let's create a dummy dataset with 150000 rows similar to yours. 让我们创建一个虚拟数据集,其中包含与您相似的150000行。
>>> import pandas as pd
>>> data = pd.DataFrame({
... 'device_id': pd.np.random.randint(0, 100, 150000),
... 'datetime': pd.Series(pd.np.random.randint(1429449000, 1429649000, 150000) * 1E9).astype('datetime64[ns]')
... }).sort('datetime')
>>> data.head()
datetime device_id
113719 2015-04-19 13:10:00 34
120323 2015-04-19 13:10:01 22
91342 2015-04-19 13:10:04 9
61170 2015-04-19 13:10:08 27
103748 2015-04-19 13:10:11 65
You can use .groupby
to pre-compute groups. 您可以使用
.groupby
预先计算组。 This lets you easily identify all datetime
s for a given device_id
. 这使您可以轻松识别给定
device_id
所有datetime
。
>>> groups = data.groupby('device_id')
>>> data.ix[groups.groups.get(34)].head() # Get the data for device_id = 34
datetime device_id
113719 2015-04-19 13:10:00 34
105761 2015-04-19 13:11:30 34
85903 2015-04-19 13:18:40 34
36395 2015-04-19 13:19:55 34
108850 2015-04-19 13:20:06 34
From here, it's quick enough to identify the average differences. 从这里开始,它足以识别平均差异。
>>> def mean_diff(device_id):
... return data['datetime'][groups.groups.get(device_id)].diff().mean()
...
>>> mean_diff(34)
Timedelta('0 days 00:02:14.470746')
Since the .groupby
pre-computes the results, every successive lookup is quite fast. 由于
.groupby
预先计算结果,因此每次后续查找都非常快。 This step takes about 2 milliseconds on the 150000 rows. 在150000行上,此步骤大约需要2毫秒。
In [68]: %timeit mean_diff(34)
100 loops, best of 3: 2.03 ms per loop
You can also compute this on all device_id
like this: 您也可以像这样在所有
device_id
上进行计算:
>>> time_diff = groups.apply(lambda df: df.datetime.diff().mean())
>>> time_diff.head()
device_id
0 00:02:12.871504
1 00:02:10.464099
2 00:02:09.550000
3 00:02:15.845003
4 00:02:14.642375
dtype: timedelta64[ns]
This is pretty fast too. 这也非常快。 For these 150,000 rows, it takes under 50ms.
对于这150,000行,它需要不到50ms的时间。 Of course, your mileage may vary.
当然,您的里程可能会有所不同。
In [79]: %timeit groups.apply(lambda df: df.datetime.diff().mean())
10 loops, best of 3: 46.6 ms per loop
To get a dictionary of average difference between Timestamps for unique user IDs 获取唯一用户ID的时间戳之间平均差异的字典
device_ids = df.device_id.unique()
device_tdelta = {device: df.loc[df.device_id == device, 'datetime'].diff().mean()
for device in df.device_id.unique()}
You then need to convert these timedeltas to seconds: 然后,您需要将这些时间增量转换为秒:
from pandas.tslib import NaT
device_seconds = {device: ts.total_seconds()
if not isinstance(ts, pd.tslib.NaTType)
else NaT
for device, ts in device_tdelta.iteritems()}
If the datetime column is in the form of a string, the first need to be converted to Pandas Timestamps. 如果datetime列为字符串形式,则第一个需要转换为Pandas Timestamps。
df.datetime = [pd.Timestamp(ts) for ts in df.datetime]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.