[英]Improve code efficiency when iterating through each row: Pandas Dataframe
The code below, calculates the duration and distance between two dataframes and if the duration and distance is less than a specific amount , a value is appended to a new dataframe. 下面的代码计算两个数据帧之间的持续时间和距离,如果持续时间和距离小于特定量,则将值附加到新的数据帧。
The code below is computationally expensive especially for a large dataframe. 下面的代码在计算上特别昂贵,尤其是对于大型数据帧而言。
Linked_df=pd.DataFrame()
#for each unique date
for unq_date in R_Unique_Dates:
#print('unq_dat: ',unq_date)
#obtain dataframe of Mi and Ri of a specific date
#add a column for index to track orignal index
M=Mi.loc[(pd.to_datetime(Mi ['EventDate']) == unq_date) ]
R=Ri.loc[(pd.to_datetime(Ri['EventDate']) == unq_date) ]
#Check if unique date exist in M
if ( M.empty==False) :
for indexR, rowR in R.iterrows():
#get duration
for indexM, rowM in M.iterrows():
duration=datetime.combine(date.today(), rowR['EventTime']) - datetime.combine(date.today(), rowM['EventTime'])
dayys = duration.days
if (duration.days < 0):
duration=datetime.combine(date.today(), rowM['EventTime']) - datetime.combine(date.today(), rowR['EventTime'])
dayis = duration.days
hours, remainder = divmod(duration.seconds, 3600)
minutes, seconds = divmod(remainder, 60)
if (hours==0)&(minutes==0)&(seconds<11):
range_15m=dist_TwoPoints_LatLong(rowR['lat_t'],rowR['lon_t'],rowM['lat'],rowM['long'])
#print(range_15m)
if (range_15m <15):
#append in new dataframe
rowM['y']=rowR['y']
row1 = pd.DataFrame(rowM)
row1 = row1.transpose()
Linked_df= pd.concat([Linked_df, row1], ignore_index=True)
Suppose the data in Mi and Ri are the following: 假设Mi和Ri中的数据如下:
Ri Dataset Ri数据集
lat_t lon_t y speed_t sprung_weight_t duration_capture EventDate EventTime
-27.7816 22.9939 4 27.1 442.0 2.819999933242798 2017/11/01 12:09:15
-27.7814 22.9939 3 27.3 447.6 2.8359999656677246 2017/11/01 12:09:18
-27.7812 22.9939 3 25.4 412.2 2.884000062942505 2017/11/01 12:09:21
-27.7809 22.994 3 26.1 413.6 2.9670000076293945 2017/11/01 12:09:23
-27.7807 22.9941 3 25.4 395.0 2.938999891281128 2017/11/01 12:09:26
-27.7805 22.9941 3 21.7 451.9 3.2829999923706055 2017/11/01 12:09:29
-27.7803 22.9942 3 20.2 441.7 3.6730000972747803 2017/11/01 12:09:33
-27.7801 22.9942 4 16.7 443.3 4.25 2017/11/01 12:09:36
-27.7798 22.9942 3 15.4 438.2 4.819000005722046 2017/11/01 12:09:41
-27.7796 22.9942 3 15.4 436.1 5.0309998989105225 2017/11/01 12:09:45
-27.7794 22.9942 4 15.8 451.6 5.232000112533569 2017/11/01 12:09:50
-27.7793 22.9941 3 18.2 439.4 4.513000011444092 2017/11/01 12:09:56
-27.7791 22.9941 3 21.4 413.7 3.8450000286102295 2017/11/01 12:10:00
-27.7788 22.994 3 23.1 430.8 3.485999822616577 2017/11/01 12:10:04
Mi Dataset 小米数据集
lat lon EventDate EventTime
-27.7786 22.9939 2017/11/01 12:10:07
-27.7784 22.9939 2017/11/01 12:10:10
-27.7782 22.9939 2017/11/02 12:10:14
-27.778 22.9938 2017/11/02 12:10:17
-27.7777 22.9938 2017/11/02 12:10:21
Linked_df Linked_df
lat_t lon_t y EventDate EventTime
-27.7786 22.9939 3 2017/11/01 12:10:07
-27.7784 22.9939 3 2017/11/01 12:10:10
How can the code be optimized? 如何优化代码?
NB: Open to dask dataframes solutions as well. 注意:也可以开放数据框架解决方案。 There are date that are the same. 有相同的日期。 Note the dataset is larger than the example above and is taking over a week to complete its run. 请注意,数据集比上面的示例大,并且需要一个多星期才能完成其运行。 The most important condition is the distance needs to be less than 15 meters and the time difference is 10 sec or less. 最重要的条件是距离必须小于15米,并且时差应小于或等于10秒。 It is also not required to calculate the duration since it is not stored. 由于不存储持续时间,因此也不需要计算持续时间。 There may be alternate ways to determine if the duration is less than 10 seconds that may take less computational time. 可能存在其他方法来确定持续时间是否少于10秒,这可能会花费较少的计算时间。
If you want speed do not use iterrows() if you can avoid it. 如果要提高速度,请避免使用iterrows()。 Vectorization can give you 50-fold or 100-fold improvement in speeds. 向量化可以使速度提高50倍或100倍。
This is an example of how to use vectorization on your code: 这是一个如何在代码上使用向量化的示例:
for unq_date in R_Unique_Dates:
M=Mi.loc[(pd.to_datetime(Mi['EventDate']) == unq_date) ]
R=Ri.loc[(pd.to_datetime(Ri['EventDate']) == unq_date) ]
M['date'] = pd.to_datetime(date.today() +' '+ M['EventTime'])
R['date'] = pd.to_datetime(date.today() +' '+ R['EventTime'])
M['duration'] = M['date'] - R['date']
M.loc[M.duration < 0, 'duration'] = R['date'] - M['date']
...
This way you will avoid using iterrows(). 这样,您将避免使用iterrows()。
This code might not work out of the box given that we don't have the data you are using, but you should follow this idea: do the operation in the whole dataframe at the same time (vectorization) rather than iterating over it (iterrows()). 鉴于我们没有您正在使用的数据,此代码可能无法立即使用,但您应该遵循以下思想:在整个数据帧中同时进行操作(向量化),而不是对其进行迭代(迭代) ())。 Loops are bad for performance. 循环不利于性能。 This article is great at explaining this concept. 本文擅长于解释这个概念。
The outer loop for unq_date in R_Unique_Dates:
can be expressed as a groupby, but I would recommend starting with the above. for unq_date in R_Unique_Dates:
的外部循环可以表示为groupby,但是我建议从上面开始。 Using groupby can be a bit confusing when you are starting. 开始时使用groupby可能会有些混乱。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.