简体   繁体   English

熊猫数据框-合并两个基于“ InStr”> 0的数据框

[英]Pandas Data Frame - Merge Two Data Frames based on “InStr” > 0

I have two DataFrames in Python Pandas. 我在Python Pandas中有两个DataFrame。

Data stored in the cells are as follows: 存储在单元格中的数据如下:

DF1
- DatabaseId    Integer
- DatabaseName  String

DF2
- CreateString  String

I want to apply the column DataBaseID to any record in DF2 where the DF1.DatabaseName exists within the context of Create String. 我想将列DataBaseID应用于DF2中DF1.DatabaseName存在于创建字符串的上下文中的任何记录。

Example:
DatabaseName = "UserDB"        CreateString = "This create string would fail"
DatabaseName = "UserDB"        CreateString = "This create string has UserDB in it"

The first record would fail and not be included in the resulting set. 第一条记录将失败,并且不会包含在结果集中。 The second record would succeed and would be in the resulting set. 第二条记录将成功,并位于结果集中。

I've researched a variety of options including .isin , and .contains , but these have not worked. 我研究了各种选项,包括.isin.contains ,但是这些选项没有用。 This seems to be a 'controlled' Cartesian join with an 'if match found success' condition. 这似乎是带有“如果找到匹配成功”条件的“受控”笛卡尔连接。 But I haven't been able to find a way to do this, and it efficiently. 但是我一直没有找到有效的方法来做到这一点。

Total list size needing to be evaluated are between 100K and 500K each. 需要评估的列表总数在100K到500K之间。

UPDATE Added More Example Data: UPDATE添加了更多示例数据:

>>> DF1.head(10)
DatabaseID     DatabaseName
0              DB1
1              DB2
2              DB3
3              DB4
...

>>> DF2.head(10)
CreateString
None
None
None
CREATE VIEW DB1.Table1 AS LOC…
None
REPLACE VIEW DB3.Table3...
CREATE VIEW DB3.Table10 AS SELE...
CREATE VIEW DB55.Table999 AS SELEC...
...

Desired Result
DatabaseID      DatabaseName        CreateText
0               DB1                 CREATE VIEW DB1.Table1 AS LOC…
2               DB3                 REPLACE VIEW DB3.Table3...
2               DB3                 CREATE VIEW DB3.Table10 AS SELE...
...
etc...
...

UPDATE: how to parse table name: UPDATE:如何解析表名称:

In [100]: df2['TableName'] = df2.CreateString.str.extract('\s+(\w+\.\w+)\s+', expand=True)

In [101]: df2
Out[101]:
                            CreateString DatabaseName      TableName
0                                   None          NaN            NaN
1                                   None          NaN            NaN
2                                   None          NaN            NaN
3         CREATE VIEW DB1.Table1 AS LOC…          DB1     DB1.Table1
4                                   None          NaN            NaN
5            REPLACE VIEW DB3.Table3 ...          DB3     DB3.Table3
6     CREATE VIEW DB3.Table10 AS SELE...          DB3    DB3.Table10
7  CREATE VIEW DB55.Table999 AS SELEC...         DB55  DB55.Table999

Original answer: 原始答案:

you can do it this way: 您可以这样操作:

In [83]: df2['DatabaseName'] = df2.CreateString.str.extract('\s+(\w+)\.\w+\s+', expand=True)

In [84]: pd.merge(df2, df1, on='DatabaseName', how='left')
Out[84]:
                            CreateString DatabaseName  DatabaseID
0                                   None          NaN         NaN
1                                   None          NaN         NaN
2                                   None          NaN         NaN
3         CREATE VIEW DB1.Table1 AS LOC…          DB1         0.0
4                                   None          NaN         NaN
5            REPLACE VIEW DB3.Table3 ...          DB3         2.0
6     CREATE VIEW DB3.Table10 AS SELE...          DB3         2.0
7  CREATE VIEW DB55.Table999 AS SELEC...         DB55         NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM