使用熊猫计算日期时间行平均值的最快方法

Question

I have 122864 row of data. 我有122864行数据。 I am storing data in HDF5 file. 我将数据存储在HDF5文件中。 Using pandas for data processing. 使用熊猫进行数据处理。 For each unique id in record there is a timestamp associated indicating time when user opened an app. 对于记录中的每个唯一ID，都有一个时间戳，指示用户打开应用程序的时间。 I want to get average duration between two hits of app. 我想获得两次应用点击之间的平均持续时间。

1283    2015-04-01 08:07:44.131768
1284    2015-04-01 08:08:02.752611
1285    2015-04-01 08:08:02.793380
1286    2015-04-01 08:07:53.910469
1287    2015-04-01 08:08:03.305893
1288    2015-04-01 08:07:44.843050
1289    2015-04-01 08:07:54.767203
1290    2015-04-01 08:08:03.965367
1291    2015-04-01 08:07:45.924854
1292    2015-04-01 08:07:55.408593
1293    2015-04-01 08:07:46.365128

class User(object):

    ''' 
    Properties and function related to each object.

    attributes:

        datetime: a list of hit timestamp for each user object
        deviceid: unique deviceid
    '''

    def __init__(self, User, device_id):
        self.datetime = pd.to_datetime(list(User['datetime']))
        self.deviceid = device_id
        self.avrgtime = 0.0
        avgtime.setdefault(self.deviceid, 1)

    def avg_duration(self):

        '''
        average duration b/w hits for each user.
        '''

        for i,time in enumerate(self.datetime[:-1]):
            self.avrgtime += abs(self.datetime[i+1] - time).total_seconds()
        avgtime[self.deviceid] = self.avrgtime/len(self.datetime)
        pp.pprint(avgtime)
            #avgtime[] = datetime.strptime(time, '%Y-%m-%d %H:%M:%S.%f')

        pass


def eachdevice(gstore):
    count = 0
    for did in list(gstore['data'].drop_duplicates('device_id')['device_id']):

     auser = gstore.select('data', where="device_id == did")
     gamer = User(auser, did) 
     gamer.avg_duration()
     count+=1
     print count



#main workshore
if __name__ == '__main__':

    try:
        path = os.path.abspath(sys.argv[1])
        with pd.HDFStore('Gamer.h5') as gstore:
            eachdevice(gstore)            

    except IndexError:
        print('\nPass path of the HDF5 file to be analyized...EXITING\n')

What I am doing till now is looping through each unique_id and using pandas dataframe select querying datetime for each unique id. 到目前为止，我正在做的是遍历每个unique_id，并使用pandas dataframe select查询每个唯一ID的日期时间。 This returns datetime object dataframe. 这将返回日期时间对象数据框。 I convert this to list and then loop to calculate average difference between two timestamps. 我将此转换为列表，然后循环计算两个时间戳之间的平均差。 This approach takes lots of time. 这种方法需要很多时间。 Is there any way to do this in using pandas? 使用熊猫有什么方法吗？

Please help. 请帮忙。

EDIT: after commenting out all the calculation part i run the code. 编辑：注释掉所有计算部分后，我运行代码。 I think this auser = gstore.select('data', where="device_id == did") is taking all the time. 我认为这个auser = gstore.select（'data'，where =“ device_id == did”）正在花费所有时间。 How to improve? 怎么提高？ any alternative or better way? 还有其他更好的方法吗？ %timeit result :1 loops, best of 3: 13.3 s per loop for 1000 iterations. ％timeit结果：1个循环，每循环最好3：13.3 s，可进行1000次迭代。

Edit: Sample data: 编辑：样本数据：

                           device_id                    datetime
0   c4be7e55d98914647c51329edc2ab734  2015-03-30 22:00:05.922317
1   05fed9f8e07c3cac457723286d36f621  2015-03-30 22:00:07.895672
2   783faeed9fe35a3f45b521b3a6667a2d  2015-03-30 22:00:05.529631
3   c2022ad838cec35bdb12fc3a6e2cf452  2015-03-30 21:59:59.043905
4   a8a04268ee0c22b26af59e053390cf6f  2015-03-30 22:00:14.248542
5   4e5ed16b44b9cd38c408859d1d241e2d  2015-03-30 22:00:02.391719
6   c0bfd3f9046855ffaaec4d99c367fd8c  2015-03-30 22:00:18.649193
7   95f1182c6e4d601ba0b20f5204168ecb  2015-03-30 22:00:13.629728
8   a85caa7e0a4a7d57e6330c083daff326  2015-03-30 22:00:08.340469
9   46cdbee963814cdb4e6a0ac0049b8fc6  2015-03-30 22:00:23.152820
10  3c8bf70679cd9c6f18aa52d06e0e181d  2015-03-30 22:00:17.619251
11  52bc4e3d9dc373d89ec31effe10e6f30  2015-03-30 22:00:11.591954
12  3477eb25e26b6bff0bfc6c3ee59a5f40  2015-03-30 22:00:25.745083
13  e7bf8ae864f2148831628a6f2e8e406e  2015-03-30 22:00:20.911568
14  a15af8faffd655a3e80f85840bbf3c2a  2015-03-30 22:00:19.017887
15  9d9f71f080c0cf478ec4117e78ff89ee  2015-03-30 22:00:28.435585
16  1633d88738316e3602890499b1f778b1  2015-03-30 22:00:24.108234
17  3362daf99f11541acbf45e70fdaf5f49  2015-03-30 22:00:24.512366
18  96c3c005eaaaa8d6af3f2443ca8f73df  2015-03-30 22:00:29.713550
19  002642b9ed495f84318fcb42557f53e1  2015-03-30 22:00:37.936647

Answer 1

Let's create a dummy dataset with 150000 rows similar to yours. 让我们创建一个虚拟数据集，其中包含与您相似的150000行。

>>> import pandas as pd
>>> data = pd.DataFrame({
...     'device_id': pd.np.random.randint(0, 100, 150000),
...     'datetime': pd.Series(pd.np.random.randint(1429449000, 1429649000, 150000) * 1E9).astype('datetime64[ns]')
... }).sort('datetime')
>>> data.head()
                  datetime  device_id
113719 2015-04-19 13:10:00         34
120323 2015-04-19 13:10:01         22
91342  2015-04-19 13:10:04          9
61170  2015-04-19 13:10:08         27
103748 2015-04-19 13:10:11         65

You can use .groupby to pre-compute groups. 您可以使用.groupby预先计算组。 This lets you easily identify all datetime s for a given device_id . 这使您可以轻松识别给定device_id所有datetime 。

>>> groups = data.groupby('device_id')
>>> data.ix[groups.groups.get(34)].head()   # Get the data for device_id = 34
                  datetime  device_id
113719 2015-04-19 13:10:00         34
105761 2015-04-19 13:11:30         34
85903  2015-04-19 13:18:40         34
36395  2015-04-19 13:19:55         34
108850 2015-04-19 13:20:06         34

From here, it's quick enough to identify the average differences. 从这里开始，它足以识别平均差异。

>>> def mean_diff(device_id):
...     return data['datetime'][groups.groups.get(device_id)].diff().mean()
...
>>> mean_diff(34)
Timedelta('0 days 00:02:14.470746')

Since the .groupby pre-computes the results, every successive lookup is quite fast. 由于.groupby预先计算结果，因此每次后续查找都非常快。 This step takes about 2 milliseconds on the 150000 rows. 在150000行上，此步骤大约需要2毫秒。

In [68]: %timeit mean_diff(34)
100 loops, best of 3: 2.03 ms per loop

You can also compute this on all device_id like this: 您也可以像这样在所有device_id上进行计算：

>>> time_diff = groups.apply(lambda df: df.datetime.diff().mean())
>>> time_diff.head()
device_id
0   00:02:12.871504
1   00:02:10.464099
2   00:02:09.550000
3   00:02:15.845003
4   00:02:14.642375
dtype: timedelta64[ns]

This is pretty fast too. 这也非常快。 For these 150,000 rows, it takes under 50ms. 对于这150,000行，它需要不到50ms的时间。 Of course, your mileage may vary. 当然，您的里程可能会有所不同。

In [79]: %timeit groups.apply(lambda df: df.datetime.diff().mean())
10 loops, best of 3: 46.6 ms per loop

Answer 2

To get a dictionary of average difference between Timestamps for unique user IDs 获取唯一用户ID的时间戳之间平均差异的字典

device_ids = df.device_id.unique()
device_tdelta = {device: df.loc[df.device_id == device, 'datetime'].diff().mean() 
                         for device in df.device_id.unique()}

You then need to convert these timedeltas to seconds: 然后，您需要将这些时间增量转换为秒：

from pandas.tslib import NaT

device_seconds = {device: ts.total_seconds() 
                          if not isinstance(ts, pd.tslib.NaTType) 
                          else NaT 
                          for device, ts in device_tdelta.iteritems()}

If the datetime column is in the form of a string, the first need to be converted to Pandas Timestamps. 如果datetime列为字符串形式，则第一个需要转换为Pandas Timestamps。

df.datetime = [pd.Timestamp(ts) for ts in df.datetime]

使用熊猫计算日期时间行平均值的最快方法

问题描述

2 个解决方案

解决方案1
3 已采纳 2015-04-19 13:33:23

解决方案2
1 2015-04-18 22:42:24

使用熊猫计算日期时间行平均值的最快方法

问题描述

2 个解决方案

解决方案1 3 已采纳 2015-04-19 13:33:23

解决方案2 1 2015-04-18 22:42:24

解决方案1
3 已采纳 2015-04-19 13:33:23

解决方案2
1 2015-04-18 22:42:24