简体   繁体   English

根据一列值的组合有效拆分 pandas dataframe

[英]Efficiently split pandas dataframe based on combinations of one column values

Lets say I have a dataframe with one column and it has 3 unique values假设我有一个 dataframe 有一列,它有 3 个唯一值

Click here to see input点击这里查看输入

import pandas as pd
df = pd.DataFrame(['a', 'b', 'c'], columns = ['string'])
df

I want to split this dataframe into smaller data frames, such that each dataframe will contain 2 unique values.我想将此 dataframe 拆分为更小的数据帧,这样每个 dataframe 将包含 2 个唯一值。 In the above case I need 3 data frames 3c2(nCr) = 3. df1 - [ab] df2 - [ac] df3 - [bc].在上述情况下,我需要 3 个数据帧 3c2(nCr) = 3。df1 - [ab] df2 - [ac] df3 - [bc]。 Please click on the below link to see my current implementation.请点击下面的链接查看我当前的实现。

Click here to see current code and output单击此处查看当前代码和 output

import itertools
for i in itertools.combinations(df.string.values, 2):
    print(df[df.string.isin(i)], '\n')

I am looking something like groupby in pandas.我在 pandas 中寻找类似 groupby 的东西。 Because sub-setting data inside loop is time consuming.因为循环内的子设置数据非常耗时。 In one of the sample case, I have 609 unique values and it was taking around 3 mins to complete the loop.在一个示例案例中,我有 609 个唯一值,完成循环大约需要 3 分钟。 So, looking for some optimized way to perform the same operation, as the unique values may shoot up to 1000's in real scenarios.因此,寻找一些优化的方法来执行相同的操作,因为在真实场景中唯一值可能会高达 1000 个。

It will be slow because you're creating 370k dataframes.它会很慢,因为您正在创建 370k 数据帧。 If all of them are supposed to only hold two values, why does it need to be a dataframe?如果它们都应该只包含两个值,为什么它需要是 dataframe?

df = pd.DataFrame({'x': range(100)})
df['key'] = 1
records = df.merge(df, on='key').drop('key', axis=1).to_dict('r')
[pd.Series(x) for x in records]

You will see that records is computed quite fast but then it takes a few minutes to generate all of these series objects.您会看到records的计算速度非常快,但是生成所有这些系列对象需要几分钟。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM