如果两列的值在第三列 pandas 中相同，则合并两列

Question

I have a dataframe (pandas):我有一个数据框（熊猫）：

import pandas as pd
df = pd.DataFrame({'A': ['x1', 'x2', 'x3', 'x4'], 
                   'B': ['b', 'b', 'c', 'c'],
                   'C': ['d', 'd', 'e', 'e'],
                   'D': ['x', 'y', 'y', 'x'],})

I want to merge the values of all columns where the values in A are unique.我想合并 A 中值唯一的所有列的值。

ouput = pd.DataFrame({'A': ['x1', 'x2', 'x3', 'x4'], 
                     'BC': ['bd', 'bd', 'ce', 'ce'],
                      'D': ['x', 'y', 'y', 'x'],})

It would be best to have a solution that works independently of column names B, C (perhaps there are also more columns with this "redundant information").最好有一个独立于列名 B、C 工作的解决方案（也许还有更多列具有这种“冗余信息”）。 The column name of A is known. A 的列名是已知的。

Given the case that my initial dataframe is:鉴于我的初始数据框是：

df = pd.DataFrame({'A': ['x1', 'x2', 'x3', 'x4'], 
                   'B': ['b', 'b', 'c', 'c'],
                   'C': ['d', 'd', 'd', 'e'],
                   'D': ['x', 'y', 'y', 'x'],})

the desired output is the initial df (no change):所需的输出是初始 df（无变化）：

df = pd.DataFrame({'A': ['x1', 'x2', 'x3', 'x4'], 
                   'B': ['b', 'b', 'c', 'c'],
                   'C': ['d', 'd', 'd', 'e'],
                   'D': ['x', 'y', 'y', 'x'],})

Many thanks!非常感谢！

Full solution (thanks to Robby the Belgian):完整的解决方案（感谢比利时人罗比）：

import pandas as pd
df = pd.DataFrame({'A': ['x1', 'x2', 'x3', 'x4'],
                   'B': ['b', 'b', 'c', 'c'],
                   'C': ['d', 'd', 'e', 'e'],
                   'D': ['x', 'y', 'y', 'x']})

print(df)

def is_redundant(df, A, B):
    #remove column a
    A=A
    B=B
    if len(df.groupby(f'{A}')) == len(df.groupby([f'{A}', f'{B}'])):

        return True
    else:
        return False

def drop_redundant(df, redundant_groups):
    list=redundant_groups
    for i in list:
        if len(df.groupby(f'{i[0]}')) == len(df.groupby([f'{i[0]}', f'{i[1]}'])):
            df[f'{i[0]}' + f'{i[1]}'] = df[[f'{i[0]}', f'{i[1]}']].sum(axis=1)
            df.drop([f'{i[0]}', f'{i[1]}'], axis=1, inplace=True)
            return(df)
        else:
            return(df)

cols = [c for c in df.columns if c != 'A']
redundant_groups = []
idx_left = 0
while idx_left < len(cols)-1:
    new_group = []
    idx_right = idx_left+1
    while idx_right < len(cols):
        if is_redundant(df, cols[idx_left], cols[idx_right]):
            new_group.append(cols.pop(idx_right))
        else:
            idx_right += 1
    if new_group:
        redundant_groups.append(new_group + [cols[idx_left]])
    idx_left += 1

print(redundant_groups)

drop_redundant(df, redundant_groups)

print(df)

Output:输出：

  A  B  C  D
0  x1  b  d  x
1  x2  b  d  y
2  x3  c  e  y
3  x4  c  e  x
[['C', 'B']]
    A  D  CB
0  x1  x  db
1  x2  y  db
2  x3  y  ec
3  x4  x  ec
[Finished in 0.837s]

Answer 1

To compare whether columns 'B' and 'C' are "redundant":要比较列'B'和'C'是否“冗余”：

len(df.groupby('B')) == len(df.groupby(['B', 'C'])

This checks whether adding 'C' to the grouping labels requires us to add more groups, compared to only grouping by 'B' .这将检查将'C'添加到分组标签是否需要我们添加更多组，而'B'仅按'B'分组。

You can then easily run this on all pairs of labels in df.columns (making sure to not include 'A' ).然后，您可以轻松地在df.columns所有标签对上运行它（确保不包含'A' ）。

If you find that two columns have redundant information, you can use:如果发现两列有冗余信息，可以使用：

df['B' + 'C'] = df[['B', 'C']].sum(axis=1)
df.drop(['B', 'C'], axis=1, inplace=True)

to replace them with the combined information.用组合信息替换它们。

If you want to use this in a double loop (checking all pairs of columns), you'll have to be careful, since you might have 3 columns that all contain the same information (say, B, C, and F), and after dealing with B and C you would try to compare B and F -- but column B does no longer exist.如果您想在双循环中使用它（检查所有列对），您必须小心，因为您可能有 3 列都包含相同的信息（例如，B、C 和 F），并且处理完 B 和 C 后，您将尝试比较 B 和 F——但 B 列不再存在。

To deal with this, I might try first constructing a list of all pairs that are redundant.为了解决这个问题，我可能首先尝试构建一个所有冗余对的列表。 Let's assume we have a " is_redundant(df, c1, c2) " function (which uses the above line to compare).假设我们有一个“ is_redundant(df, c1, c2) ”函数（使用上面的行进行比较）。

cols = [c for c in df.columns if c != 'A']
redundant_groups = []
idx_left = 0
while idx_left < len(cols)-1:
    new_group = []
    idx_right = idx_left+1
    while idx_right < len(cols):    
        if is_redundant(df, cols[idx_left], cols[idx_right]):
            new_group.append(cols.pop(idx_right))
        else:
            idx_right += 1
    if new_group:
        redundant_groups.append(new_group + [cols[idx_left]])
    idx_left += 1

This creates groups of columns that are all mutually redundant.这将创建所有相互冗余的列组。

After that, you can easily modify the above combination code to deal with multiple columns at once.之后，您可以轻松修改上述组合代码以同时处理多个列。

如果两列的值在第三列 pandas 中相同，则合并两列

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-09-02 17:09:36

如果两列的值在第三列 pandas 中相同，则合并两列

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-09-02 17:09:36

解决方案1
0 已采纳 2020-09-02 17:09:36