简体   繁体   中英

Optimizing Pandas DataFrame Filtering

I have a pandas dataframe that, at every iteration of a loop, I need to find a specific row in based on a condition unique to each iteration. I do this by doing:

condition = ((all_together["Chr"] == chrom) & 
(all_together["Start"] <= true_center) & 
(all_together["End"] >= true_center))   
this_bin = all_together[condition] 

Where all_together is the name of my dataframe, and chrom and true_center are parameters unique to each loop iteration.

Based on %prun and %lprun profiling, the vast majority of time spent is in the "Start" and "End" parts of the condition. I optimized the "Chr" lookup by converting to a categorical datatype, but for "Start" and "End" columns the value in the dataframe is a number that needs to be compared to true_center, another number.

Does anyone have any ideas on how to speed this up? The data is effectively "sorted" numerically but I can't find a good way to use that to y advantage here. Any other approaches are welcome too, thanks for any help!

For a couple of conditions, you may find np.logical_and more efficient:

import pandas as pd, numpy as np

np.random.seed(0)

df = pd.DataFrame({'val': np.random.randint(0, 100, 10000000)})

x = np.logical_and(df['val'] >= 20, df['val'] <= 60)
y = df['val'].between(20, 60)
z = (df['val'] >= 20) & (df['val'] <= 60)

assert (x==y).all() and (y==z).all()

%timeit np.logical_and(df['val'] >= 20, df['val'] <= 60)  # 36.8 ms
%timeit df['val'].between(20, 60)                         # 59.7 ms
%timeit (df['val'] >= 20) & (df['val'] <= 60)             # 60.4 ms

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM