简体   繁体   English

如何在DataFrame中合并具有值组合的行

[英]How to merge rows with combination of values in a DataFrame

I have a DataFrame (df1) as given below 我有一个DataFrame(df1),如下所示

    Hair  Feathers  Legs  Type  Count
 R1  1       NaN     0     1      1
 R2  1        0      Nan   1      32
 R3  1        0      2     1      4
 R4  1       Nan     4     1      27

I want to merge rows based by different combinations of the values in each column and also want to add the count values for each merged row. 我想根据每列中值的不同组合来合并行,并且还想为每个合并行添加计数值。 The resultant dataframe(df2) will look like this: 结果数据框(df2)将如下所示:

    Hair  Feathers  Legs  Type  Count
 R1   1      0        0     1     33
 R2   1      0        2     1     36
 R3   1      0        4     1     59

The merging is performed in such a way that any Nan value will be merged with 0 or 1. In df2, R1 is calculated by merging the Nan value of Feathers (df1,R1) with the 0 value of Feathers (df1,R2). 合并的方式是将任何Nan值与0或1合并。在df2中,R1是通过将Feathers(df1,R1)的Nan值与Feathers(df1,R2)的0值合并而计算的。 Similarly, the value of 0 in Legs (df1,R1) is merged with Nan value of Legs (df1,R2). 类似地,Legs(df1,R1)中的0值与Legs(df1,R2)的Nan值合并。 Then the count of R1 (1) and R2(32) are added. 然后,将R1(1)和R2(32)的计数相加。 In the same manner R2 and R3 are merged because Feathers value in R2 (df1) is similar to R3 (df1) and Legs value of Nan is merged with 2 in R3 (df1) and the count of R2 (32) and R3 (4) are added. 以相同的方式合并R2和R3,因为R2(df1)中的Feathers值类似于R3(df1), Nan Legs值与R3(df1)中的2合并,并且R2(32)和R3的计数(4 )添加。

I hope the explanation makes sense. 我希望这种解释是有道理的。 Any help will be highly appreciated 任何帮助将不胜感激

A possible way to do it is by replicating each of the rows containing NaN and fill them with values for the column. 一种可能的方法是复制包含NaN每一行,并用该列的值填充它们。

First, we need to get the possible not-null unique values per column: 首先,我们需要获取每列可能不为空的唯一值:

unique_values = df.iloc[:, :-1].apply(
       lambda x: x.dropna().unique().tolist(), axis=0).to_dict()   
> unique_values
{'Hair': [1.0], 'Feathers': [0.0], 'Legs': [0.0, 2.0, 4.0], 'Type': [1.0]}

Then iterate through each row of the dataframe and replace each NaN by the possible values for each column. 然后遍历数据帧的每一行,并用每一列的可能值替换每个NaN We can do this using pandas.DataFrame.iterrows : 我们可以使用pandas.DataFrame.iterrows做到这pandas.DataFrame.iterrows

mask = df.iloc[:, :-1].isnull().any(axis=1)

# Keep the rows that do not contain `Nan`
# and then added modified rows

list_of_df = [r for i, r in df[~mask].iterrows()]

for row_index, row in df[mask].iterrows(): 

    for c in row[row.isnull()].index: 

        # For each column of the row, replace 
        # Nan by possible values for the column

        for v in unique_values[c]: 

            list_of_df.append(row.copy().fillna({c:v})) 

df_res = pd.concat(list_of_df, axis=1, ignore_index=True).T

The result is a dataframe where all the NaN have been filled with possible values for the column: 结果是一个数据框,其中所有NaN均已填充了该列的可能值:

> df_res

   Hair  Feathers  Legs  Type  Count
0   1.0       0.0   2.0   1.0    4.0
1   1.0       0.0   0.0   1.0    1.0
2   1.0       0.0   0.0   1.0   32.0
3   1.0       0.0   2.0   1.0   32.0
4   1.0       0.0   4.0   1.0   32.0
5   1.0       0.0   4.0   1.0   27.0

To get the final result of Count grouping by possible combinations of ['Hair', 'Feathers', 'Legs', 'Type'] we just need to do: 要通过['Hair', 'Feathers', 'Legs', 'Type']的可能组合获得Count分组的最终结果,我们只需要执行以下操作:

> df_res.groupby(['Hair', 'Feathers', 'Legs', 'Type']).sum().reset_index()  

   Hair  Feathers  Legs  Type  Count
0   1.0       0.0   0.0   1.0   33.0
1   1.0       0.0   2.0   1.0   36.0
2   1.0       0.0   4.0   1.0   59.0

Hope it serves 希望它有用

UPDATE UPDATE

If one or more of the elements in the row are missing, the procedure looking for all the possible combinations for the missing values at the same time. 如果缺少该行中的一个或多个元素,则该过程将同时查找缺失值的所有可能组合。 Let us add a new row with two elements missing: 让我们添加一个缺少两个元素的新行:

> df

   Hair  Feathers  Legs  Type  Count
0   1.0       NaN   0.0   1.0    1.0
1   1.0       0.0   NaN   1.0   32.0
2   1.0       0.0   2.0   1.0    4.0
3   1.0       NaN   4.0   1.0   27.0
4   1.0       NaN   NaN   1.0   32.0

We will proceed in similar way, but the replacements combinations will be obtained using itertools.product : 我们将以类似的方式进行,但是将使用itertools.product获得替换组合:

 import itertools 

 unique_values = df.iloc[:, :-1].apply(
       lambda x: x.dropna().unique().tolist(), axis=0).to_dict()

 mask = df.iloc[:, :-1].isnull().any(axis=1) 

 list_of_df = [r for i, r in df[~mask].iterrows()] 

 for row_index, row in df[mask].iterrows():  

     cols = row[row.isnull()].index.tolist() 

     for p in itertools.product(*[unique_values[c] for c in cols]): 

         list_of_df.append(row.copy().fillna({c:v for c, v in zip(cols, p)}))

 df_res = pd.concat(list_of_df, axis=1, ignore_index=True).T       


> df_res.sort_values(['Hair', 'Feathers', 'Legs', 'Type']).reset_index(drop=True)

Hair  Feathers  Legs  Type  Count
1   1.0       0.0   0.0   1.0    1.0
2   1.0       0.0   0.0   1.0   32.0
6   1.0       0.0   0.0   1.0   32.0
0   1.0       0.0   2.0   1.0    4.0
3   1.0       0.0   2.0   1.0   32.0
7   1.0       0.0   2.0   1.0   32.0
4   1.0       0.0   4.0   1.0   32.0
5   1.0       0.0   4.0   1.0   27.0
8   1.0       0.0   4.0   1.0   32.0

> df_res.groupby(['Hair', 'Feathers', 'Legs', 'Type']).sum().reset_index()

   Hair  Feathers  Legs  Type  Count
0   1.0       0.0   0.0   1.0   65.0
1   1.0       0.0   2.0   1.0   68.0
2   1.0       0.0   4.0   1.0   91.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM