[英]Pandas Data Frame - Merge Two Data Frames based on “InStr” > 0
I have two DataFrames in Python Pandas. 我在Python Pandas中有两个DataFrame。
Data stored in the cells are as follows: 存储在单元格中的数据如下:
DF1
- DatabaseId Integer
- DatabaseName String
DF2
- CreateString String
I want to apply the column DataBaseID to any record in DF2 where the DF1.DatabaseName exists within the context of Create String. 我想将列DataBaseID应用于DF2中DF1.DatabaseName存在于创建字符串的上下文中的任何记录。
Example:
DatabaseName = "UserDB" CreateString = "This create string would fail"
DatabaseName = "UserDB" CreateString = "This create string has UserDB in it"
The first record would fail and not be included in the resulting set. 第一条记录将失败,并且不会包含在结果集中。 The second record would succeed and would be in the resulting set. 第二条记录将成功,并位于结果集中。
I've researched a variety of options including .isin
, and .contains
, but these have not worked. 我研究了各种选项,包括.isin
和.contains
,但是这些选项没有用。 This seems to be a 'controlled' Cartesian join with an 'if match found success' condition. 这似乎是带有“如果找到匹配成功”条件的“受控”笛卡尔连接。 But I haven't been able to find a way to do this, and it efficiently. 但是我一直没有找到有效的方法来做到这一点。
Total list size needing to be evaluated are between 100K and 500K each. 需要评估的列表总数在100K到500K之间。
UPDATE Added More Example Data: UPDATE添加了更多示例数据:
>>> DF1.head(10)
DatabaseID DatabaseName
0 DB1
1 DB2
2 DB3
3 DB4
...
>>> DF2.head(10)
CreateString
None
None
None
CREATE VIEW DB1.Table1 AS LOC…
None
REPLACE VIEW DB3.Table3...
CREATE VIEW DB3.Table10 AS SELE...
CREATE VIEW DB55.Table999 AS SELEC...
...
Desired Result
DatabaseID DatabaseName CreateText
0 DB1 CREATE VIEW DB1.Table1 AS LOC…
2 DB3 REPLACE VIEW DB3.Table3...
2 DB3 CREATE VIEW DB3.Table10 AS SELE...
...
etc...
...
UPDATE: how to parse table name: UPDATE:如何解析表名称:
In [100]: df2['TableName'] = df2.CreateString.str.extract('\s+(\w+\.\w+)\s+', expand=True)
In [101]: df2
Out[101]:
CreateString DatabaseName TableName
0 None NaN NaN
1 None NaN NaN
2 None NaN NaN
3 CREATE VIEW DB1.Table1 AS LOC… DB1 DB1.Table1
4 None NaN NaN
5 REPLACE VIEW DB3.Table3 ... DB3 DB3.Table3
6 CREATE VIEW DB3.Table10 AS SELE... DB3 DB3.Table10
7 CREATE VIEW DB55.Table999 AS SELEC... DB55 DB55.Table999
Original answer: 原始答案:
you can do it this way: 您可以这样操作:
In [83]: df2['DatabaseName'] = df2.CreateString.str.extract('\s+(\w+)\.\w+\s+', expand=True)
In [84]: pd.merge(df2, df1, on='DatabaseName', how='left')
Out[84]:
CreateString DatabaseName DatabaseID
0 None NaN NaN
1 None NaN NaN
2 None NaN NaN
3 CREATE VIEW DB1.Table1 AS LOC… DB1 0.0
4 None NaN NaN
5 REPLACE VIEW DB3.Table3 ... DB3 2.0
6 CREATE VIEW DB3.Table10 AS SELE... DB3 2.0
7 CREATE VIEW DB55.Table999 AS SELEC... DB55 NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.