[英]How can I match values from different dataframes based on some conditions or function using pandas?
Suppose I have two dataframes as below.假设我有两个数据框,如下所示。
raw_data = {
'name': ['Jason love you', 'Molly hope wish care', 'happy birthday', 'dog cat', 'tiger legend bird'],
'nationality': ['USA', 'USA', 'France', 'UK', 'UK']
}
raw_data_2 = {
'name_2': ['Jason you', 'Molly care wist', 'hapy birthday', 'dog', 'tiger bird'],
'nationality': ['USA', 'USA', 'France', 'UK', 'JK'],
'code': ['a', 'b','c','d','e']
}
df1 = pd.DataFrame(raw_data, columns = ['name', 'nationality'])
df2 = pd.DataFrame(raw_data_2, columns = ['name_2', 'nationality', 'code'])
What I want to do is matching two dataframes based on some conditions.我想要做的是根据某些条件匹配两个数据帧。 The condition here is that
这里的条件是
raw_data_2
which is a subset of a value (name) from raw_data_1
when these two names are split by space, andraw_data_2
这是从一个值(名)的一个子集raw_data_1
当这两个名字是通过分割的空间,并 For easier understanding here's an example: from raw_data_2
, 'Jason You'.split(' ') = ['Jason', 'You']
, so this is a subset of 'Jason Love You'.split(' ') = ['Jason', 'Love', 'You']
.为了更容易理解,这里有一个例子:从
raw_data_2
, 'Jason You'.split(' ') = ['Jason', 'You']
,所以这是'Jason Love You'.split(' ') = ['Jason', 'Love', 'You']
。 But 'Molly care wist'.split(' ')
is NOT a subset of 'Molly care wish'.split(' ')
because the latter does not cover the former entirely (perfectly).但
'Molly care wist'.split(' ')
不是一个子集'Molly care wish'.split(' ')
因为后者不包括前完全(完美)。 'tiger bird'.split(' ')
from raw_data_2
is a subset of 'tiger legend bird'.split(' ')
, but their nationality is different.来自
raw_data_2
'tiger bird'.split(' ')
是'tiger legend bird'.split(' ')
的子集,但它们的国籍不同。
If we meet the above conditions, then finally I want to assign the code
value from raw_data_2
.如果我们满足上述条件,那么最后我想从
raw_data_2
分配code
值。 So the desired output(let's just take the code
s) would be like:所以所需的输出(让我们只取
code
s)将是这样的:
'a'(matched), Nan(unmatched), Nan(unmatched), 'd', Nan(unmatched)
How can I do this by using pandas?我怎样才能通过使用熊猫来做到这一点? I guess this is not just as simple as 'isin' function or 'map' function.
我想这不仅仅是“isin”函数或“map”函数那么简单。
Using <=
operator to test for subset使用
<=
运算符测试子集
name = df1.name.str.split().apply(set)
name2 = df2.name_2.str.split().apply(set)
cond1 = name2 <= name
cond2 = df1.nationality == df2.nationality
pd.concat([df1, df2], axis=1, keys=['df1', 'df2']).loc[cond1 & cond2]
df1 df2
name nationality name_2 nationality code
0 Jason love you USA Jason you USA a
3 dog cat UK dog UK d
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.