简体   繁体   English

突出显示 pandas 中每一行的逐列差异

[英]highlight column by column difference for each row in pandas

Let's say I have a pd.DataFrame that looks as such:假设我有一个pd.DataFrame看起来像这样:

id  col1_a  col1_b  col2_a  col2_b
1   x       x       2       3  
2   z       d       4       5
3   y       y       9       9
4   p       p       8       1

What this dataframe represents is a 2 dataframe ( df_a , df_b ) column by column comparison.这个 dataframe 代表的是 2 dataframe ( df_a , df_b ) 逐列比较。

I am trying to get a dataframe that highlights and finds the columns that contain those differences as such:我正在尝试获取 dataframe 突出显示并找到包含这些差异的列:

id  col1_a  col1_b  col2_a  col2_b   diff
1   x       x       2       3        col2
2   z       d       4       5        col1,col2
3   y       y       9       9        None
4   p       p       8       1        col2

How can I achieve something like this without having to doubly traverse through the cols and rows.我怎样才能实现这样的事情而不必双重遍历列和行。

I know I can achieve this by doing something similar (not tested):我知道我可以通过做类似的事情来实现这一点(未经测试):

for col_ptr1 in df.columns:
   for col_ptr2 in df.columns:
      for idx, row in df.iterrows():
         if col_ptr1.strip('_a') == col_ptr2.strip('_b'):
            blah blah blah...

This is super ugly.这是超级丑陋的。 I wonder if there is a more pandas style approach to this.我想知道是否有更多pandas风格的方法来解决这个问题。

Select the subset of columns containing col , then split these column names around delimiter _ and extract the first component of split using the str accessor Select 包含col的列的子集,然后将这些列名split为分隔符_并使用str访问器提取split的第一个组件

Now, group the dataframe on the col prefix extracted in the previous step, and agg using nunique along axis=1 to count the unqiue values.现在,将 dataframe group到上一步中提取的col前缀上,并使用沿axis=1的 nunique 进行 agg 以计算nunique值。 Check for the unique values if not equal to one then add the corresponding column names in diff columns using dataframe.dot如果不等于一,则检查唯一值,然后使用dataframe.dot在差异列中添加相应的列名

c = df.filter(regex=r'_(a|b)$')
m = c.groupby(c.columns.str.split('_').str[0], axis=1).nunique().ne(1)
df['diff'] = m.dot(m.columns + ',').str[:-1]

   id col1_a col1_b  col2_a  col2_b       diff
0   1      x      x       2       3       col2
1   2      z      d       4       5  col1,col2
2   3      y      y       9       9           
3   4      p      p       8       1       col2

Here is another way with groupby on axis=1 to create common groups and then compare each group with the second column and get the column name when they don't match:这是在axis = 1上使用groupby创建公共组的另一种方法,然后将每个组与第二列进行比较,并在它们不匹配时获取列名:

u = df.set_index("id")

cols = u.columns.str.split("_").str[0]
l = (g.ne(g.iloc[:,-1],axis=0) for i,g in u.groupby(cols,axis=1))

df['diff_'] = df['id'].map(pd.concat(l,axis=1).dot(cols+',').str[:-1])

print(df)

   id col1_a col1_b  col2_a  col2_b      diff_
0   1      x      x       2       3       col2
1   2      z      d       4       5  col1,col2
2   3      y      y       9       9           
3   4      p      p       8       1       col2

I suppose an answer can be this:我想答案可能是这样的:

cols = [(x, y) for x in df.columns for y in df.columns if x.strip('_a') == y.strip('_b')]

diffs = []
for idx, row in df.iterrows():
    for c in col:
        if row[c[0]] != row[c[1]]:
            # difference found!

You can use apply function, which also provide more efficient computation than direct iterations.您可以使用apply function,它还提供比直接迭代更有效的计算。 This code can be easily generalized to the case on many column pairs.这段代码可以很容易地推广到许多列对的情况。

def find_diff(x):
    res = ''
    if (x['col1_a'] != x['col1_b']):
        res += 'col1'
    if (x['col2_a'] != x['col2_b']):
        res += 'col2' if len(res) == 0 else ',col2'
    return None if res == '' else res

df['diff'] = df.apply(find_diff, axis=1)

More information about apply function can be found here .更多关于申请function 的信息可以在这里找到。

You can use pandas.DataFrame.apply :您可以使用pandas.DataFrame.apply

import pandas as pd

def get_diff(x):
    if x.col1_a == x.col1_b and x.col2_a == x.col2_b:
        return None
    if x.col1_a != x.col1_b and x.col2_a != x.col2_b:
        return 'col1,col2'
    if x.col1_a != x.col1_b:
        return 'col1'
    if x.col2_a != x.col2_b:
        return 'col2'

df['diff'] = df.apply(get_diff, axis=1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM