My program needs to fetch a row based on the value in a column from a huge Pandas Dataframe. The response time is critical. I use the most common way to do it, for example:
df.loc[df['id'] == 500000, :]
Per timeit
on my Mac it took 4 ms to complete the above operation on a dataframe with 1 million rows. But my goal is to reduce the time to like 0.4 ms. I once consider to convert this dataframe to a Set but Set is not ordered and does not natively support indexing or slicing. Any suggestions?
Lets set this up:
import pandas as pd
import numpy as np
df = pd.DataFrame({"id": np.random.randint(100,size=(1000000,))})
Then let's benchmark some options. Your current boolean + .loc
:
>>> timeit.timeit("df.loc[df['id'] == 50, :]", setup = "from __main__ import df", number=1000)
2.566220869999597
The query engine:
>>> timeit.timeit("df.query('id == 50')", setup = "from __main__ import df", number=1000)
14.591400260000228
Using the index as a separate lookup:
>>> idx = pd.Index(df['id'])
>>> timeit.timeit("df.loc[idx == 50, :]", setup = "from __main__ import df, idx", number=1000)
2.2155187300013495
Using the dataframe index for lookup:
>>> df.index = df["id"]
>>> timeit.timeit("df.loc[50, :]", setup = "from __main__ import df", number=1000)
2.625610274999417
And that .isin()
idea that someone in the comments had:
>>> timeit.timeit("df.loc[df['id'].isin([50]), :]", setup = "from __main__ import df", number=1000)
9.542700138999862
Looks like with the exception of the query engine being slow (as expected) for a simple equality, you're not going to get much better than the lookup time you've got.
df_unique = pd.DataFrame({'id': range(1000000)})
Lets see how a unique ID might be helpful:
>>> timeit.timeit("df_unique.loc[df_unique['id'] == 50, :]", setup = "from __main__ import df_unique", number=1000)
1.9672015519990964
Then to a dict:
>>> df_unique.index = df_unique['id']
>>> df_dict = df_unique.to_dict(orient='index')
>>> timeit.timeit("df_dict[50]", setup = "from __main__ import df_dict", number=1000)
6.247700002859347e-05
Well, looks like this is a clear winner.
>>> timeit.timeit("pd.Series(df_dict[50])", setup = "from __main__ import df_dict, pd", number=1000)
0.2747819870000967
Even if you have to cast it back to a series for something this is an order of magnitude faster than before. (You also could map a series back into the dict very easily if needed and keep the speed of a dict lookup with no overhead)
检查df.query('id == 500000') 的工作速度。
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.