![](/img/trans.png)
[英]Group and Compare multiple dataframe columns with conditions in Python
[英]Compare dataframe columns with conditions
我有2个数据框,如下所示:
df1:
ID col1 col2
1 A1 B1
2 A2 B2
3 A3 B3
4 A4 B4
5 A5 B5
6 A6 B6
df2:
col1 col2
A1 B1
A2 O5
H3 B3
A4 B4
A5 66
A6 C6
预期结果:我想根据条件生成结果df-df1的col1,col2中的每个值都应存在于df2的col1,col2值中
预期结果df:
ID col1 col2 Error
1 A1 B1 No mismatch with df2
2 A2 B2 col2 mismatch with df2
3 A3 B3 col1 mismatch with df2
4 A4 B4 No mismatch with df2
5 A5 B5 col2 mismatch with df2
6 A6 B6 col2 mismatch with df2
这样的事情应该可以解决问题,但是可能会有更简单的方法。
diff = pd.concat([df1[col] == df2[col] for col in df1], axis=1)
def m(row):
mismatches = []
for col in diff.columns:
if not row[col]:
mismatches.append(col)
if mismatches == []:
return 'No mismatch'
return 'Mismatches: ' + ', '.join(mismatches)
df1['Error'] = diff.apply(m, axis=1)
创建字典的理解与比较帮手数据帧isin
:
m = pd.DataFrame({c: ~df1[c].isin(df2[c]) for c in ['col1','col2']})
print (m)
col1 col2
0 False False
1 False True
2 True False
3 False False
4 False True
5 False True
然后使用numpy.where
并使用any
掩码进行测试,以测试每行至少一个True
,并使用dot
进行矩阵乘法以获取列名:
df1['Error'] = np.where(m.any(axis=1),
m.dot(m.columns + ', ').str.rstrip(', ') + ' mismatch with df2',
'No mismatch with df2')
print (df1)
ID col1 col2 Error
0 1 A1 B1 No mismatch with df2
1 2 A2 B2 col2 mismatch with df2
2 3 A3 B3 col1 mismatch with df2
3 4 A4 B4 No mismatch with df2
4 5 A5 B5 col2 mismatch with df2
5 6 A6 B6 col2 mismatch with df2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.