突出显示 pandas 中每一行的逐列差异

Question

Let's say I have a pd.DataFrame that looks as such:假设我有一个pd.DataFrame看起来像这样：

id  col1_a  col1_b  col2_a  col2_b
1   x       x       2       3  
2   z       d       4       5
3   y       y       9       9
4   p       p       8       1

What this dataframe represents is a 2 dataframe ( df_a , df_b ) column by column comparison.这个 dataframe 代表的是 2 dataframe ( df_a , df_b ) 逐列比较。

I am trying to get a dataframe that highlights and finds the columns that contain those differences as such:我正在尝试获取 dataframe 突出显示并找到包含这些差异的列：

id  col1_a  col1_b  col2_a  col2_b   diff
1   x       x       2       3        col2
2   z       d       4       5        col1,col2
3   y       y       9       9        None
4   p       p       8       1        col2

How can I achieve something like this without having to doubly traverse through the cols and rows.我怎样才能实现这样的事情而不必双重遍历列和行。

I know I can achieve this by doing something similar (not tested):我知道我可以通过做类似的事情来实现这一点（未经测试）：

for col_ptr1 in df.columns:
   for col_ptr2 in df.columns:
      for idx, row in df.iterrows():
         if col_ptr1.strip('_a') == col_ptr2.strip('_b'):
            blah blah blah...

This is super ugly.这是超级丑陋的。 I wonder if there is a more pandas style approach to this.我想知道是否有更多pandas风格的方法来解决这个问题。

Answer 1

Select the subset of columns containing col , then split these column names around delimiter _ and extract the first component of split using the str accessor Select 包含col的列的子集，然后将这些列名split为分隔符_并使用str访问器提取split的第一个组件

Now, group the dataframe on the col prefix extracted in the previous step, and agg using nunique along axis=1 to count the unqiue values.现在，将 dataframe group到上一步中提取的col前缀上，并使用沿axis=1的 nunique 进行 agg 以计算nunique值。 Check for the unique values if not equal to one then add the corresponding column names in diff columns using dataframe.dot如果不等于一，则检查唯一值，然后使用dataframe.dot在差异列中添加相应的列名

c = df.filter(regex=r'_(a|b)$')
m = c.groupby(c.columns.str.split('_').str[0], axis=1).nunique().ne(1)
df['diff'] = m.dot(m.columns + ',').str[:-1]

   id col1_a col1_b  col2_a  col2_b       diff
0   1      x      x       2       3       col2
1   2      z      d       4       5  col1,col2
2   3      y      y       9       9           
3   4      p      p       8       1       col2

Answer 2

Here is another way with groupby on axis=1 to create common groups and then compare each group with the second column and get the column name when they don't match:这是在axis = 1上使用groupby创建公共组的另一种方法，然后将每个组与第二列进行比较，并在它们不匹配时获取列名：

u = df.set_index("id")

cols = u.columns.str.split("_").str[0]
l = (g.ne(g.iloc[:,-1],axis=0) for i,g in u.groupby(cols,axis=1))

df['diff_'] = df['id'].map(pd.concat(l,axis=1).dot(cols+',').str[:-1])

print(df)

   id col1_a col1_b  col2_a  col2_b      diff_
0   1      x      x       2       3       col2
1   2      z      d       4       5  col1,col2
2   3      y      y       9       9           
3   4      p      p       8       1       col2

Answer 3

I suppose an answer can be this:我想答案可能是这样的：

cols = [(x, y) for x in df.columns for y in df.columns if x.strip('_a') == y.strip('_b')]

diffs = []
for idx, row in df.iterrows():
    for c in col:
        if row[c[0]] != row[c[1]]:
            # difference found!

Answer 4

You can use apply function, which also provide more efficient computation than direct iterations.您可以使用apply function，它还提供比直接迭代更有效的计算。 This code can be easily generalized to the case on many column pairs.这段代码可以很容易地推广到许多列对的情况。

def find_diff(x):
    res = ''
    if (x['col1_a'] != x['col1_b']):
        res += 'col1'
    if (x['col2_a'] != x['col2_b']):
        res += 'col2' if len(res) == 0 else ',col2'
    return None if res == '' else res

df['diff'] = df.apply(find_diff, axis=1)

More information about apply function can be found here .更多关于申请function 的信息可以在这里找到。

Answer 5

You can use pandas.DataFrame.apply :您可以使用pandas.DataFrame.apply ：

import pandas as pd

def get_diff(x):
    if x.col1_a == x.col1_b and x.col2_a == x.col2_b:
        return None
    if x.col1_a != x.col1_b and x.col2_a != x.col2_b:
        return 'col1,col2'
    if x.col1_a != x.col1_b:
        return 'col1'
    if x.col2_a != x.col2_b:
        return 'col2'

df['diff'] = df.apply(get_diff, axis=1)

突出显示 pandas 中每一行的逐列差异

问题描述

5 个解决方案

解决方案1
6 已采纳 2021-05-13 15:44:26

解决方案2
4 2021-05-13 15:46:49

解决方案3
0 2021-05-13 15:37:51

解决方案4
0 2021-05-13 15:42:19

解决方案5
0 2021-05-13 15:43:44

突出显示 pandas 中每一行的逐列差异

问题描述

5 个解决方案

解决方案1 6 已采纳 2021-05-13 15:44:26

解决方案2 4 2021-05-13 15:46:49

解决方案3 0 2021-05-13 15:37:51

解决方案4 0 2021-05-13 15:42:19

解决方案5 0 2021-05-13 15:43:44

解决方案1
6 已采纳 2021-05-13 15:44:26

解决方案2
4 2021-05-13 15:46:49

解决方案3
0 2021-05-13 15:37:51

解决方案4
0 2021-05-13 15:42:19

解决方案5
0 2021-05-13 15:43:44