简体   繁体   English

将pandas df转换为不同长度的列表列表

[英]Convert pandas df to list of lists with varying length

I have a dataframe just like this and I need to convert it to a list of lists according to the "desired output"我有一个像这样的数据框,我需要根据“所需的输出”将其转换为列表列表

    d = {'0': ['A','A','A','B','B','C','C','C','C'],
         '1': [4.3,3.2,2.1,9.1,2.0,2.8,1.7,0.8,0.2]}
    
    df = pd.DataFrame(d)
    
            0   1
        0   A   4.3
        1   A   3.0
        2   A   2.1
        3   B   9.0
        4   B   2.0
        5   C   2.8
        6   C   1.7
        7   C   0.8
        8   C   0.2
    
    # Desired output
[[4.3, 3.2, 2.1], [9.1, 2.0], [2.8, 1.7, 0.8, 0.2]]

I wrote the following to do it and it gets the job done:我写了以下内容来完成它并完成工作:

d_tuples = [*list(zip(df[0],df[1]))]
keys = df[0].unique()
list_of_lists = []
for key in keys:
    list_of_lists+=[[tup[1] for tup in d_tuples if tup[0] == key]]
list_of_lists  #[[4.3, 3.2, 2.1], [9.1, 2.0], [2.8, 1.7, 0.8, 0.2]]

However, the original database is about 25,000,000 rows long and its taking some time, I was wondering if theres a more efficient way to write it.但是,原始数据库大约有 25,000,000 行长并且需要一些时间,我想知道是否有更有效的方法来编写它。

EDIT: "desired output" means a list_of_lists where each list contains the values in column "1" for one of the unique values in column "0"编辑:“所需的输出”是指一个 list_of_lists,其中每个列表都包含“1”列中的值,用于“0”列中的唯一值之一

EDIT2: Added timeit results EDIT2:添加 timeit 结果

groupby object is dict, you may use it to avoid agg to speed up more groupby 对象是 dict,您可以使用它来避免agg以加快速度

In [229]: [v.tolist() for v in df.set_index('1').groupby('0').groups.values()]
Out[229]: [[4.3, 3.2, 2.1], [9.1, 2.0], [2.8, 1.7, 0.8, 0.2]]

Timing on 90K rows 90K 行的计时

df = pd.concat([df] * 10000)

%timeit [v.tolist() for v in df.set_index('1').groupby('0').groups.values()]
15.2 ms ± 425 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df.groupby('0')['1'].agg(list).tolist()
32.8 ms ± 623 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [236]: %%timeit
     ...: d_tuples = [*list(zip(df['0'],df['1']))]
     ...: keys = df['0'].unique()
     ...: list_of_lists = []
     ...: for key in keys:
     ...:     list_of_lists+=[[tup[1] for tup in d_tuples if tup[0] == key]]
     ...:
69.4 ms ± 754 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

use .groupby with .agg(list) with another .tolist() call.groupby.agg(list)与另一个.tolist()调用一起使用

df = pd.DataFrame(d)
df.groupby('0')['1'].agg(list).tolist()
#out.
[[4.3, 3.2, 2.1], [9.1, 2.0], [2.8, 1.7, 0.8, 0.2]]

May be using apply()可能正在使用apply()

list(df.groupby('0')['1'].apply(list))

#[[4.3, 3.2, 2.1], [9.1, 2.0], [2.8, 1.7, 0.8, 0.2]]

There probably isn't a faster way to do it with default python lists, unfortunately.不幸的是,使用默认的 python 列表可能没有更快的方法。 Depending on data you have you might use numpy arrays (they are memory efficient, so that will give you a speed-up) — ie, list(df.groupby('0')['1'].apply(np.array)) .根据您拥有的数据,您可能会使用 numpy 数组(它们具有内存效率,因此可以加快速度)——即list(df.groupby('0')['1'].apply(np.array)) Depending on the number of unique keys, speed-up can be anywhere from 10% to 100% (according to local tests on my machine).根据唯一键的数量,加速可以从 10% 到 100%(根据我机器上的本地测试)。

PS By the way, don't test with small dataframes. PS 顺便说一句,不要用小数据帧进行测试。 Create a bigger one like this:像这样创建一个更大的:

N = 500
keys = np.arange(0, N)

df = {
    '0': keys[np.random.randint(0, N, size=int(1e6))],
    '1': np.random.rand(int(1e6))
}
    
df = pd.DataFrame(df)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM