[英]Pandas Compare two dataframes and determine the matched values
I have the following dataframes:我有以下数据框:
print(dfa)
ID Value
AA12 101 BB101 CC01 DE06 1
AA11 102 BB101 CC01 234 EE07 2
AA10 202 BB101 CC01 345 EE09 3
AA13 103 BB101 CC02 123 4
AA14 203 BB101 CC02 456 5
AA15 204 BB102 CC03 567 6
print(dfb)
ID Value
AA10 202 BB101 CC01 EE09 345 3
AA11 102 BB101 CC01 EE07 234 2
AA12 101 BB101 CC01 DE06 1
AA13 103 BB101 CC02 123 4
AA18 203 BB103 CC01 456 5
AA15 204 BB201 CC11 678 7
I would like to compare the string in (dfa.ID, dfa.Value) to the one in (dfb.ID, dfb.Value).我想将 (dfa.ID, dfa.Value) 中的字符串与 (dfb.ID, dfb.Value) 中的字符串进行比较。 If they match exactly (even when the order of substring is not identical), I would like to print 'Yes' on new 'ID Matched?'
如果它们完全匹配(即使子字符串的顺序不相同),我想在新的“ID 匹配?”上打印“是”。 and 'Value Matched?'
和“价值匹配?” columns in dataframe 'dfa'.
数据框“dfa”中的列。
Desired output would be:期望的输出是:
ID Value ID Matched? Value Matched?
AA12 101 BB101 CC01 DE06 1 Yes Yes
AA11 102 BB101 CC01 234 EE07 2 Yes Yes
AA10 202 BB101 CC01 345 EE09 3 Yes Yes
AA13 103 BB101 CC02 123 4 Yes Yes
AA14 203 BB101 CC02 456 5 No Yes
AA15 204 BB102 CC03 567 6 No No
you can do something similar to this:你可以做类似的事情:
In [40]: pd.merge(a.assign(x=a.ID.str.split().apply(sorted).str.join(' ')),
...: b.assign(x=b.ID.str.split().apply(sorted).str.join(' ')),
...: on=['x','Value'],
...: how='outer',
...: indicator=True)
...:
Out[40]:
ID_x Value x \
0 AA12 101 BB101 CC01 DE06 1 101 AA12 BB101 CC01 DE06
1 AA11 102 BB101 CC01 234 EE07 2 102 234 AA11 BB101 CC01 EE07
2 AA10 202 BB101 CC01 345 EE09 3 202 345 AA10 BB101 CC01 EE09
3 AA13 103 BB101 CC02 123 4 103 123 AA13 BB101 CC02
4 AA14 203 BB101 CC02 456 5 203 456 AA14 BB101 CC02
5 AA15 204 BB102 CC03 567 6 204 567 AA15 BB102 CC03
6 NaN 5 203 456 AA18 BB103 CC01
7 NaN 7 204 678 AA15 BB201 CC11
ID_y _merge
0 AA12 101 BB101 CC01 DE06 both
1 AA11 102 BB101 CC01 EE07 234 both
2 AA10 202 BB101 CC01 EE09 345 both
3 AA13 103 BB101 CC02 123 both
4 NaN left_only
5 NaN left_only
6 AA18 203 BB103 CC01 456 right_only
7 AA15 204 BB201 CC11 678 right_only
Explanation:解释:
In [43]: a.ID.str.split()
Out[43]:
0 [AA12, 101, BB101, CC01, DE06]
1 [AA11, 102, BB101, CC01, 234, EE07]
2 [AA10, 202, BB101, CC01, 345, EE09]
3 [AA13, 103, BB101, CC02, 123]
4 [AA14, 203, BB101, CC02, 456]
5 [AA15, 204, BB102, CC03, 567]
Name: ID, dtype: object
In [44]: a.ID.str.split().apply(sorted)
Out[44]:
0 [101, AA12, BB101, CC01, DE06]
1 [102, 234, AA11, BB101, CC01, EE07]
2 [202, 345, AA10, BB101, CC01, EE09]
3 [103, 123, AA13, BB101, CC02]
4 [203, 456, AA14, BB101, CC02]
5 [204, 567, AA15, BB102, CC03]
Name: ID, dtype: object
In [45]: a.assign(x=a.ID.str.split().apply(sorted).str.join(' '))
Out[45]:
ID Value x
0 AA12 101 BB101 CC01 DE06 1 101 AA12 BB101 CC01 DE06
1 AA11 102 BB101 CC01 234 EE07 2 102 234 AA11 BB101 CC01 EE07
2 AA10 202 BB101 CC01 345 EE09 3 202 345 AA10 BB101 CC01 EE09
3 AA13 103 BB101 CC02 123 4 103 123 AA13 BB101 CC02
4 AA14 203 BB101 CC02 456 5 203 456 AA14 BB101 CC02
5 AA15 204 BB102 CC03 567 6 204 567 AA15 BB102 CC03
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.