简体   繁体   中英

Best way to get last entries from Pandas data frame

I recently had to get the last set status for certain items, labeled with ids. I found this answer: Python : How can I get Rows which have the max value of the group to which they belong?

To my surprise on a dataset with only ~2e6 rows it was fairly slow. However I do not need to get all max values, only the last one.

import numpy as np
import pandas as pd

df = pd.DataFrame({
    "id": np.random.randint(1, 1000, size=5000),
    "status": np.random.randint(1, 10, size=5000),
    "date": [
        time.strftime("%Y-%m-%d", time.localtime(time.time() - x))
        for x in np.random.randint(-5e7, 5e7, size=5000)
    ],
})

%timeit df.groupby('id').apply(lambda t: t[t.date==t.date.max()])
1 loops, best of 3: 576 ms per loop

%timeit df.reindex(df.sort_values(["date"], ascending=False)["id"].drop_duplicates().index)
100 loops, best of 3: 4.82 ms per loop

The first one was the solution I found in the link, which seems a way which allows more complex operations.

However for my issue I could sort and drop duplicates and reindex, which performs a lot better. Especially on larger data sets this really makes a difference.

My Questions: are there other ways to achieve what I want to do? Possibly with even better performance?

Another way to approach this is to use an aggregation on the groupby, followed by a selection on the full dataframe.

df.iloc[df.groupby('id')['date'].idxmax()]

This appears to be about a factor of 5-10 faster than the solutions you proposed (see below). Note that this will only work if the 'date' column is numerical rather than string type, and that this transformation also speeds up your sorting-based solution:

# Timing your original solutions:
%timeit df.groupby('id').apply(lambda t: t[t.date==t.date.max()])
# 1 loops, best of 3: 826 ms per loop
%timeit df.reindex(df.sort_values(["date"], ascending=False)["id"].drop_duplicates().index)
# 100 loops, best of 3: 5.1 ms per loop

# convert the date
df['date'] = pd.to_datetime(df['date'])

# new times on your solutions
%timeit df.groupby('id').apply(lambda t: t[t.date==t.date.max()])
# 1 loops, best of 3: 815 ms per loop
%timeit df.reindex(df.sort_values(["date"], ascending=False)["id"].drop_duplicates().index)
# 1000 loops, best of 3: 1.99 ms per loop

# my aggregation solution
%timeit df.iloc[df.groupby('id')['date'].idxmax()]
# 10 loops, best of 3: 135 ms per loop

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM