Optimizing Pandas DataFrame Filtering

Question

I have a pandas dataframe that, at every iteration of a loop, I need to find a specific row in based on a condition unique to each iteration. I do this by doing:

condition = ((all_together["Chr"] == chrom) & 
(all_together["Start"] <= true_center) & 
(all_together["End"] >= true_center))   
this_bin = all_together[condition]

Where all_together is the name of my dataframe, and chrom and true_center are parameters unique to each loop iteration.

Based on %prun and %lprun profiling, the vast majority of time spent is in the "Start" and "End" parts of the condition. I optimized the "Chr" lookup by converting to a categorical datatype, but for "Start" and "End" columns the value in the dataframe is a number that needs to be compared to true_center, another number.

Does anyone have any ideas on how to speed this up? The data is effectively "sorted" numerically but I can't find a good way to use that to y advantage here. Any other approaches are welcome too, thanks for any help!

Answer 1

For a couple of conditions, you may find np.logical_and more efficient:

import pandas as pd, numpy as np

np.random.seed(0)

df = pd.DataFrame({'val': np.random.randint(0, 100, 10000000)})

x = np.logical_and(df['val'] >= 20, df['val'] <= 60)
y = df['val'].between(20, 60)
z = (df['val'] >= 20) & (df['val'] <= 60)

assert (x==y).all() and (y==z).all()

%timeit np.logical_and(df['val'] >= 20, df['val'] <= 60)  # 36.8 ms
%timeit df['val'].between(20, 60)                         # 59.7 ms
%timeit (df['val'] >= 20) & (df['val'] <= 60)             # 60.4 ms

Optimizing Pandas DataFrame Filtering

Question

1 answers

solution1
0 ACCPTED 2018-07-30 21:38:10

Optimizing Pandas DataFrame Filtering

Question

1 answers

solution1 0 ACCPTED 2018-07-30 21:38:10

solution1
0 ACCPTED 2018-07-30 21:38:10