[英]highlight column by column difference for each row in pandas
Let's say I have a pd.DataFrame
that looks as such:假设我有一个
pd.DataFrame
看起来像这样:
id col1_a col1_b col2_a col2_b
1 x x 2 3
2 z d 4 5
3 y y 9 9
4 p p 8 1
What this dataframe represents is a 2 dataframe ( df_a
, df_b
) column by column comparison.这个 dataframe 代表的是 2 dataframe (
df_a
, df_b
) 逐列比较。
I am trying to get a dataframe that highlights and finds the columns that contain those differences as such:我正在尝试获取 dataframe 突出显示并找到包含这些差异的列:
id col1_a col1_b col2_a col2_b diff
1 x x 2 3 col2
2 z d 4 5 col1,col2
3 y y 9 9 None
4 p p 8 1 col2
How can I achieve something like this without having to doubly traverse through the cols and rows.我怎样才能实现这样的事情而不必双重遍历列和行。
I know I can achieve this by doing something similar (not tested):我知道我可以通过做类似的事情来实现这一点(未经测试):
for col_ptr1 in df.columns:
for col_ptr2 in df.columns:
for idx, row in df.iterrows():
if col_ptr1.strip('_a') == col_ptr2.strip('_b'):
blah blah blah...
This is super ugly.这是超级丑陋的。 I wonder if there is a more
pandas
style approach to this.我想知道是否有更多
pandas
风格的方法来解决这个问题。
Select the subset of columns containing col
, then split
these column names around delimiter _
and extract the first component of split
using the str
accessor Select 包含
col
的列的子集,然后将这些列名split
为分隔符_
并使用str
访问器提取split
的第一个组件
Now, group
the dataframe on the col
prefix extracted in the previous step, and agg using nunique
along axis=1
to count the unqiue values.现在,将 dataframe
group
到上一步中提取的col
前缀上,并使用沿axis=1
的 nunique 进行 agg 以计算nunique
值。 Check for the unique values if not equal to one then add the corresponding column names in diff columns using dataframe.dot
如果不等于一,则检查唯一值,然后使用
dataframe.dot
在差异列中添加相应的列名
c = df.filter(regex=r'_(a|b)$')
m = c.groupby(c.columns.str.split('_').str[0], axis=1).nunique().ne(1)
df['diff'] = m.dot(m.columns + ',').str[:-1]
id col1_a col1_b col2_a col2_b diff
0 1 x x 2 3 col2
1 2 z d 4 5 col1,col2
2 3 y y 9 9
3 4 p p 8 1 col2
Here is another way with groupby on axis=1 to create common groups and then compare each group with the second column and get the column name when they don't match:这是在axis = 1上使用groupby创建公共组的另一种方法,然后将每个组与第二列进行比较,并在它们不匹配时获取列名:
u = df.set_index("id")
cols = u.columns.str.split("_").str[0]
l = (g.ne(g.iloc[:,-1],axis=0) for i,g in u.groupby(cols,axis=1))
df['diff_'] = df['id'].map(pd.concat(l,axis=1).dot(cols+',').str[:-1])
print(df)
id col1_a col1_b col2_a col2_b diff_
0 1 x x 2 3 col2
1 2 z d 4 5 col1,col2
2 3 y y 9 9
3 4 p p 8 1 col2
I suppose an answer can be this:我想答案可能是这样的:
cols = [(x, y) for x in df.columns for y in df.columns if x.strip('_a') == y.strip('_b')]
diffs = []
for idx, row in df.iterrows():
for c in col:
if row[c[0]] != row[c[1]]:
# difference found!
You can use apply function, which also provide more efficient computation than direct iterations.您可以使用apply function,它还提供比直接迭代更有效的计算。 This code can be easily generalized to the case on many column pairs.
这段代码可以很容易地推广到许多列对的情况。
def find_diff(x):
res = ''
if (x['col1_a'] != x['col1_b']):
res += 'col1'
if (x['col2_a'] != x['col2_b']):
res += 'col2' if len(res) == 0 else ',col2'
return None if res == '' else res
df['diff'] = df.apply(find_diff, axis=1)
More information about apply function can be found here .更多关于申请function 的信息可以在这里找到。
You can use pandas.DataFrame.apply :您可以使用pandas.DataFrame.apply :
import pandas as pd
def get_diff(x):
if x.col1_a == x.col1_b and x.col2_a == x.col2_b:
return None
if x.col1_a != x.col1_b and x.col2_a != x.col2_b:
return 'col1,col2'
if x.col1_a != x.col1_b:
return 'col1'
if x.col2_a != x.col2_b:
return 'col2'
df['diff'] = df.apply(get_diff, axis=1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.