简体   繁体   English

如果两列的值在第三列 pandas 中相同,则合并两列

[英]Merge two columns if their values are the same in a third column pandas

I have a dataframe (pandas):我有一个数据框(熊猫):

import pandas as pd
df = pd.DataFrame({'A': ['x1', 'x2', 'x3', 'x4'], 
                   'B': ['b', 'b', 'c', 'c'],
                   'C': ['d', 'd', 'e', 'e'],
                   'D': ['x', 'y', 'y', 'x'],})

I want to merge the values of all columns where the values in A are unique.我想合并 A 中值唯一的所有列的值。

ouput = pd.DataFrame({'A': ['x1', 'x2', 'x3', 'x4'], 
                     'BC': ['bd', 'bd', 'ce', 'ce'],
                      'D': ['x', 'y', 'y', 'x'],})

It would be best to have a solution that works independently of column names B, C (perhaps there are also more columns with this "redundant information").最好有一个独立于列名 B、C 工作的解决方案(也许还有更多列具有这种“冗余信息”)。 The column name of A is known. A 的列名是已知的。

Given the case that my initial dataframe is:鉴于我的初始数据框是:

df = pd.DataFrame({'A': ['x1', 'x2', 'x3', 'x4'], 
                   'B': ['b', 'b', 'c', 'c'],
                   'C': ['d', 'd', 'd', 'e'],
                   'D': ['x', 'y', 'y', 'x'],})

the desired output is the initial df (no change):所需的输出是初始 df(无变化):

df = pd.DataFrame({'A': ['x1', 'x2', 'x3', 'x4'], 
                   'B': ['b', 'b', 'c', 'c'],
                   'C': ['d', 'd', 'd', 'e'],
                   'D': ['x', 'y', 'y', 'x'],})

Many thanks!非常感谢!

Full solution (thanks to Robby the Belgian):完整的解决方案(感谢比利时人罗比):

import pandas as pd
df = pd.DataFrame({'A': ['x1', 'x2', 'x3', 'x4'],
                   'B': ['b', 'b', 'c', 'c'],
                   'C': ['d', 'd', 'e', 'e'],
                   'D': ['x', 'y', 'y', 'x']})

print(df)

def is_redundant(df, A, B):
    #remove column a
    A=A
    B=B
    if len(df.groupby(f'{A}')) == len(df.groupby([f'{A}', f'{B}'])):

        return True
    else:
        return False

def drop_redundant(df, redundant_groups):
    list=redundant_groups
    for i in list:
        if len(df.groupby(f'{i[0]}')) == len(df.groupby([f'{i[0]}', f'{i[1]}'])):
            df[f'{i[0]}' + f'{i[1]}'] = df[[f'{i[0]}', f'{i[1]}']].sum(axis=1)
            df.drop([f'{i[0]}', f'{i[1]}'], axis=1, inplace=True)
            return(df)
        else:
            return(df)

cols = [c for c in df.columns if c != 'A']
redundant_groups = []
idx_left = 0
while idx_left < len(cols)-1:
    new_group = []
    idx_right = idx_left+1
    while idx_right < len(cols):
        if is_redundant(df, cols[idx_left], cols[idx_right]):
            new_group.append(cols.pop(idx_right))
        else:
            idx_right += 1
    if new_group:
        redundant_groups.append(new_group + [cols[idx_left]])
    idx_left += 1

print(redundant_groups)

drop_redundant(df, redundant_groups)

print(df)

Output:输出:

  A  B  C  D
0  x1  b  d  x
1  x2  b  d  y
2  x3  c  e  y
3  x4  c  e  x
[['C', 'B']]
    A  D  CB
0  x1  x  db
1  x2  y  db
2  x3  y  ec
3  x4  x  ec
[Finished in 0.837s]

To compare whether columns 'B' and 'C' are "redundant":要比较列'B''C'是否“冗余”:

len(df.groupby('B')) == len(df.groupby(['B', 'C'])

This checks whether adding 'C' to the grouping labels requires us to add more groups, compared to only grouping by 'B' .这将检查将'C'添加到分组标签是否需要我们添加更多组,而'B'仅按'B'分组。

You can then easily run this on all pairs of labels in df.columns (making sure to not include 'A' ).然后,您可以轻松地在df.columns所有标签对上运行它(确保不包含'A' )。

If you find that two columns have redundant information, you can use:如果发现两列有冗余信息,可以使用:

df['B' + 'C'] = df[['B', 'C']].sum(axis=1)
df.drop(['B', 'C'], axis=1, inplace=True)

to replace them with the combined information.用组合信息替换它们。

If you want to use this in a double loop (checking all pairs of columns), you'll have to be careful, since you might have 3 columns that all contain the same information (say, B, C, and F), and after dealing with B and C you would try to compare B and F -- but column B does no longer exist.如果您想在双循环中使用它(检查所有列对),您必须小心,因为您可能有 3 列都包含相同的信息(例如,B、C 和 F),并且处理完 B 和 C 后,您将尝试比较 B 和 F——但 B 列不再存在。

To deal with this, I might try first constructing a list of all pairs that are redundant.为了解决这个问题,我可能首先尝试构建一个所有冗余对的列表。 Let's assume we have a " is_redundant(df, c1, c2) " function (which uses the above line to compare).假设我们有一个“ is_redundant(df, c1, c2) ”函数(使用上面的行进行比较)。

cols = [c for c in df.columns if c != 'A']
redundant_groups = []
idx_left = 0
while idx_left < len(cols)-1:
    new_group = []
    idx_right = idx_left+1
    while idx_right < len(cols):    
        if is_redundant(df, cols[idx_left], cols[idx_right]):
            new_group.append(cols.pop(idx_right))
        else:
            idx_right += 1
    if new_group:
        redundant_groups.append(new_group + [cols[idx_left]])
    idx_left += 1

This creates groups of columns that are all mutually redundant.这将创建所有相互冗余的列组。

After that, you can easily modify the above combination code to deal with multiple columns at once.之后,您可以轻松修改上述组合代码以同时处理多个列。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如果两列中的值相同,则合并熊猫中的单元格 - Merge cells in pandas if values in two column is same 熊猫:将相同数据框的两列相乘,取决于第三列 - Pandas: Multiplying two columns of same dataframe thats dependent on third column 根据pandas中的第三列保留两列之间的值 - Keep values of between two columns based on third column in pandas 大熊猫:按两列分组,然后按第三列的值排序 - pandas: Grouping by two columns and then sorting it by the values of a third column 如何将两个熊猫列转换为字典,但将同一第一列(键)的所有值合并为一个键? - How to convert two pandas columns into a dictionary, but merge all values of same first column (key) into one key? Python:如果两列具有相同的值,则为第三列的和值 - Python: sum values of the third column if two columns have the same value 根据相同的列名称值在熊猫中合并两个数据框 - Merge two dataframes in pandas based on the same column name values 添加两列的两个值并将结果分配给熊猫多索引数据帧中的第三列 - Adding two values of two columns and assigning the result to a third column in a pandas multi-index DataFrame 有没有办法在使用 pandas 按第三列中的值分组时将两列中的值相乘? - Is there a way to multiply the values in two columns while grouping by values in third column using pandas? 用值分组两列以获得第三列 - Grouping two columns with values to get a third column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM