Efficiently selecting rows from pandas dataframe using sorted column

Question

I have a large-ish pandas dataframe with multiple columns (c1 ... c8) and ~32 mil rows. The dataframe is already sorted by c1. I want to grab other column values from rows that share a particular value of c1.

something like

keys = big_df['c1'].unique()
red = np.zeros(len(keys))
for i, key in enumerate(keys):
    inds = (big_df['c1'] == key)
    v1 = np.array(big_df.loc[inds]['c2'])
    v2 = np.array(big_df.loc[inds]['c6'])
    red[i] = reduce_fun(v1,v2)

However this turns out to be very slow I think because it checks the entire columns for the matching criterion (even though there might only be 10 rows out of 32 mil that are relevant). Since big_df is sorted by c1 and the keys is just the list of all unique c1's, is there a fast way to get the red[] array (ie i know the first row with the next key is the row after the last row of the previous key, I know that the last row for a key is the last row that matches the key, since all subsequent rows are guaranteed not to match).

Thanks,

Ilya

Edit: I am not sure what order unique() method produces, but I basically want to have for every key in keys a value of reduce_fun(), I don't particularly care what order they are (presumably the easiest order is the order c1 is already sorted in).

Edit2: I slightly restructured the code. Basically, is there an efficient way of constructing inds. big_df['c1'] == key takes 75.8% of total time in my data, while creating v1, v2 takes 21.6% according to line profiler.

Answer 1

Rather than a list, I chose a dictionary to hold the reduced values keyed on each item in c1 .

red = {key: reduce_func(frame['c2'].values, frame['c7'].values) 
       for key, frame in df.groupby('c1')}

Answer 2

How about a groupby statement in a list comprehension? This should be especially efficient given the DataFrame is already sorted by c1 :

Edit : Forgot that groupby returns a tuple. Oops!

red = [reduce_fun(g['c2'].values, g['c6'].values) for i, g in big_df.groupby('c1', sort=False)]

Seems to chug through pretty quickly for me (~2 seconds for 30 million random rows and a trivial reduce_fun).

Efficiently selecting rows from pandas dataframe using sorted column

Question

2 answers

solution1
5 ACCPTED 2017-08-08 02:21:10

solution2
2 2017-08-08 02:23:47

Efficiently selecting rows from pandas dataframe using sorted column

Question

2 answers

solution1 5 ACCPTED 2017-08-08 02:21:10

solution2 2 2017-08-08 02:23:47

solution1
5 ACCPTED 2017-08-08 02:21:10

solution2
2 2017-08-08 02:23:47