简体   繁体   中英

Iterate pandas.DataFrame efficiently while accessing more than one index row at a time

I already read answers and blog entries about how to iterate pandas.DataFrame efficient ( https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6 ), but i still have one question left.

Currently, my DataFrame represents a GPS trajectory containing the columns time, longitude and latitude . Now, I want to calculate a feature called distance-to-next-point . Therefore, i not only have to iterate through the rows and doing operations on the single rows, but have to access subsequent rows in a single iteration.

i=0
for index, row in df.iterrows():
    if i < len(df)-1:
        distance = calculate_distance([row['latitude'],row['longitude']],[df.loc[i+1,'latitude'],df.loc[i+1,'longitude']])
        row['distance'] = distance

Besides this problem, I have the same issue when calculating speed, applying smoothing or other similar methods.

Another example: I want to search for datapoints with speed == 0 m/s and outgoing from these points I want to add all subsequent datapoints into an array until the speed reached 10 m/s (to find segments of accelerating from 0m/s to 10m/s).

Do you have any suggestions on how to code stuff like this as efficient as possbile?

You can use pd.DataFrame.shift to add shifted series to your dataframe, then feed into your function via apply :

def calculate_distance(row):
    # your function goes here, trivial function used for demonstration
    return sum(row[i] for i in df.columns)

df[['next_latitude', 'next_longitude']] = df[['latitude', 'longitude']].shift(-1)
df.loc[df.index[:-1], 'distance'] = df.iloc[:-1].apply(calculate_distance, axis=1)

print(df)

   latitude  longitude  next_latitude  next_longitude  distance
0         1          5            2.0             6.0      14.0
1         2          6            3.0             7.0      18.0
2         3          7            4.0             8.0      22.0
3         4          8            NaN             NaN       NaN

This works for an arbitrary function calculate_distance , but the chances are your algorithm is vectorisable, in which case you should use column-wise Pandas / NumPy methods.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM