简体   繁体   English

遍历每一行时提高代码效率:Pandas Dataframe

[英]Improve code efficiency when iterating through each row: Pandas Dataframe

The code below, calculates the duration and distance between two dataframes and if the duration and distance is less than a specific amount , a value is appended to a new dataframe. 下面的代码计算两个数据帧之间的持续时间和距离,如果持续时间和距离小于特定量,则将值附加到新的数据帧。

The code below is computationally expensive especially for a large dataframe. 下面的代码在计算上特别昂贵,尤其是对于大型数据帧而言。

Linked_df=pd.DataFrame()
#for each unique date
for unq_date in R_Unique_Dates:
    #print('unq_dat: ',unq_date)
    #obtain dataframe of Mi and Ri of a specific date
    #add a column for index to track orignal index
    M=Mi.loc[(pd.to_datetime(Mi ['EventDate']) == unq_date) ]
    R=Ri.loc[(pd.to_datetime(Ri['EventDate']) == unq_date) ]
    #Check if unique date exist in M
    if ( M.empty==False) :
        for indexR, rowR in R.iterrows():
            #get duration 
            for indexM, rowM in M.iterrows():

                        duration=datetime.combine(date.today(), rowR['EventTime']) - datetime.combine(date.today(), rowM['EventTime'])
                        dayys = duration.days
                        if (duration.days < 0):
                            duration=datetime.combine(date.today(), rowM['EventTime']) - datetime.combine(date.today(), rowR['EventTime'])
                            dayis = duration.days

                        hours, remainder = divmod(duration.seconds, 3600)
                        minutes, seconds = divmod(remainder, 60) 
                        if (hours==0)&(minutes==0)&(seconds<11):
                            range_15m=dist_TwoPoints_LatLong(rowR['lat_t'],rowR['lon_t'],rowM['lat'],rowM['long'])
                            #print(range_15m)
                            if (range_15m <15):
                                #append in new dataframe 
                                rowM['y']=rowR['y']
                                row1 = pd.DataFrame(rowM)
                                row1 = row1.transpose()
                                Linked_df= pd.concat([Linked_df, row1], ignore_index=True)

Suppose the data in Mi and Ri are the following: 假设Mi和Ri中的数据如下:

Ri Dataset Ri数据集

lat_t   lon_t   y   speed_t sprung_weight_t duration_capture    EventDate   EventTime
-27.7816    22.9939 4   27.1    442.0   2.819999933242798   2017/11/01  12:09:15
-27.7814    22.9939 3   27.3    447.6   2.8359999656677246  2017/11/01  12:09:18
-27.7812    22.9939 3   25.4    412.2   2.884000062942505   2017/11/01  12:09:21
-27.7809    22.994  3   26.1    413.6   2.9670000076293945  2017/11/01  12:09:23
-27.7807    22.9941 3   25.4    395.0   2.938999891281128   2017/11/01  12:09:26
-27.7805    22.9941 3   21.7    451.9   3.2829999923706055  2017/11/01  12:09:29
-27.7803    22.9942 3   20.2    441.7   3.6730000972747803  2017/11/01  12:09:33
-27.7801    22.9942 4   16.7    443.3   4.25                2017/11/01  12:09:36
-27.7798    22.9942 3   15.4    438.2   4.819000005722046   2017/11/01  12:09:41
-27.7796    22.9942 3   15.4    436.1   5.0309998989105225  2017/11/01  12:09:45
-27.7794    22.9942 4   15.8    451.6   5.232000112533569   2017/11/01  12:09:50
-27.7793    22.9941 3   18.2    439.4   4.513000011444092   2017/11/01  12:09:56
-27.7791    22.9941 3   21.4    413.7   3.8450000286102295  2017/11/01  12:10:00
-27.7788    22.994  3   23.1    430.8   3.485999822616577   2017/11/01  12:10:04

Mi Dataset 小米数据集

lat        lon      EventDate   EventTime
-27.7786    22.9939 2017/11/01  12:10:07
-27.7784    22.9939 2017/11/01  12:10:10
-27.7782    22.9939 2017/11/02  12:10:14
-27.778     22.9938 2017/11/02  12:10:17
-27.7777    22.9938 2017/11/02  12:10:21

Linked_df Linked_df

lat_t   lon_t   y   EventDate   EventTime
-27.7786    22.9939 3   2017/11/01  12:10:07
-27.7784    22.9939 3   2017/11/01  12:10:10

How can the code be optimized? 如何优化代码?

NB: Open to dask dataframes solutions as well. 注意:也可以开放数据框架解决方案。 There are date that are the same. 有相同的日期。 Note the dataset is larger than the example above and is taking over a week to complete its run. 请注意,数据集比上面的示例大,并且需要一个多星期才能完成其运行。 The most important condition is the distance needs to be less than 15 meters and the time difference is 10 sec or less. 最重要的条件是距离必须小于15米,并且时差应小于或等于10秒。 It is also not required to calculate the duration since it is not stored. 由于不存储持续时间,因此也不需要计算持续时间。 There may be alternate ways to determine if the duration is less than 10 seconds that may take less computational time. 可能存在其他方法来确定持续时间是否少于10秒,这可能会花费较少的计算时间。

If you want speed do not use iterrows() if you can avoid it. 如果要提高速度,请避免使用iterrows()。 Vectorization can give you 50-fold or 100-fold improvement in speeds. 向量化可以使速度提高50倍或100倍。

This is an example of how to use vectorization on your code: 这是一个如何在代码上使用向量化的示例:

for unq_date in R_Unique_Dates:
    M=Mi.loc[(pd.to_datetime(Mi['EventDate']) == unq_date) ]
    R=Ri.loc[(pd.to_datetime(Ri['EventDate']) == unq_date) ]

    M['date'] = pd.to_datetime(date.today() +' '+ M['EventTime'])
    R['date'] = pd.to_datetime(date.today() +' '+ R['EventTime'])

    M['duration'] = M['date'] - R['date']
    M.loc[M.duration < 0, 'duration'] =  R['date'] - M['date']
    ...

This way you will avoid using iterrows(). 这样,您将避免使用iterrows()。

This code might not work out of the box given that we don't have the data you are using, but you should follow this idea: do the operation in the whole dataframe at the same time (vectorization) rather than iterating over it (iterrows()). 鉴于我们没有您正在使用的数据,此代码可能无法立即使用,但您应该遵循以下思想:在整个数据帧中同时进行操作(向量化),而不是对其进行迭代(迭代) ())。 Loops are bad for performance. 循环不利于性能。 This article is great at explaining this concept. 本文擅长于解释这个概念。

The outer loop for unq_date in R_Unique_Dates: can be expressed as a groupby, but I would recommend starting with the above. for unq_date in R_Unique_Dates:的外部循环可以表示为groupby,但是我建议从上面开始。 Using groupby can be a bit confusing when you are starting. 开始时使用groupby可能会有些混乱。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM