Fastest way to split a pandas dataframe into a list of subdataframes

Question

I have a large dataframe df for which I have a full list indices of unique elements in df.index . I now want to create a list of all the subdataframes indexed by elements in indices ; specifically

list_df = [df.loc[x] for x in indices]

Running this command is taking ages though ( df has about 3e6 rows, and 3e3 unique indices). Is this a reasonable way to perform this operation? I would be very happy to receive any kind of comments or suggestions that could improve the performance of this and related problems.

Thanks in advance!

Answer 1

You can use list comprehension in groupby object by index - level=0 , sort=False change default sorting for faster solution:

L = [x for i, x in df.groupby(level=0, sort=False)]

np.random.seed(123)
N = 1000
L = list('abcdefghijklmno')
df = pd.DataFrame({'A': np.random.choice(L, N),
                   'B':np.random.randint(10, size=N)}, index=np.random.randint(100, size=N))

In [273]: %timeit [x for i, x in df.groupby(level=0, sort=False)]
100 loops, best of 3: 9.91 ms per loop

In [274]: %timeit [df.loc[x] for x in df.index]
1 loop, best of 3: 417 ms per loop

Fastest way to split a pandas dataframe into a list of subdataframes

Question

1 answers

solution1
4 ACCPTED 2017-10-10 13:27:49

Fastest way to split a pandas dataframe into a list of subdataframes

Question

1 answers

solution1 4 ACCPTED 2017-10-10 13:27:49

solution1
4 ACCPTED 2017-10-10 13:27:49