简体   繁体   中英

Fastest way to split a pandas dataframe into a list of subdataframes

I have a large dataframe df for which I have a full list indices of unique elements in df.index . I now want to create a list of all the subdataframes indexed by elements in indices ; specifically

list_df = [df.loc[x] for x in indices]

Running this command is taking ages though ( df has about 3e6 rows, and 3e3 unique indices). Is this a reasonable way to perform this operation? I would be very happy to receive any kind of comments or suggestions that could improve the performance of this and related problems.

Thanks in advance!

You can use list comprehension in groupby object by index - level=0 , sort=False change default sorting for faster solution:

L = [x for i, x in df.groupby(level=0, sort=False)]

np.random.seed(123)
N = 1000
L = list('abcdefghijklmno')
df = pd.DataFrame({'A': np.random.choice(L, N),
                   'B':np.random.randint(10, size=N)}, index=np.random.randint(100, size=N))

In [273]: %timeit [x for i, x in df.groupby(level=0, sort=False)]
100 loops, best of 3: 9.91 ms per loop

In [274]: %timeit [df.loc[x] for x in df.index]
1 loop, best of 3: 417 ms per loop

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM