[英]Check if a string contains another string from different Dataframe - Python
I have two Dataframes with different columns and size. 我有两个具有不同列和大小的数据框。
The first one has some columns and one of them is a string field (column 1). 第一个有一些列,其中一个是字符串字段(列1)。 The second dataframe has 2 columns, one is a string field (column 4) with 2 words separated by comma and another is a integer field (column 5). 第二个数据帧有2列,一列是字符串字段(第4列),其中两个单词之间用逗号分隔,另一列是整数字段(第5列)。
I need to verify that column 1 in Dataframe 1 has the words in column 4 Dataframe 2 and fill in Dataframe 1 with the corresponding information in dataframe 2. 我需要验证数据框1中的第1列在第4列数据框2中有单词,并在数据框1中用数据框2中的相应信息填充。
Example: 例:
df1
column 1 column 2 column 3
0 bla bla sample1 bla bla sample2 a f
1 bla bla sample1 bla bla sample5 b g
2 bla bla sample3 bla bla sample4 c h
3 bla bla sample8 bla bla sample7 d i
4 bla bla sample1 bla bla sample2 e j
df2
column 4 column 5
0 ('sample1', 'sample2'), 50
1 ('sample3', 'sample4'), 35
2 ('sample1', 'sample5') 18
I need the output: 我需要输出:
Output:
df1
column 1 column 2 column 3 column 4 column 5
0 bla bla sample1 bla bla sample2 a f ('sample1', 'sample2') 50
1 bla bla sample1 bla bla sample5 b g ('sample1', 'sample5') 18
2 bla bla sample3 bla bla sample4 c h ('sample3', 'sample4') 35
3 bla bla sample8 bla bla sample7 d i NaN
4 bla bla sample1 bla bla sample2 e j ('sample1', 'sample2') 50
Any ideas? 有任何想法吗?
Thanks! 谢谢!
I don't guarantee this will be particularly fast, but it gets the job done. 我不保证这会特别快,但是可以完成工作。 We'll use set
logic to check for matches. 我们将使用set
逻辑检查匹配项。 We have to jump through some hoops so that we can store a list of tuples of the matches. 我们必须跳过一些箍,以便我们可以存储比赛的元组列表。 I don't think this is a particularly good idea. 我认为这不是一个特别好的主意。
import numpy as np
import pandas as pd
df1['setc'] = df1['column 1'].str.split().apply(set)
# Initialize so addition works
df1['column 4'] = [[] for i in range(len(df1))]
df1['column 5'] = 0
for idx, row in df2.iterrows():
m = (df1.setc.values & set(row['column 4'])) == set(row['column 4'])
df1.loc[m, 'column 4'] += pd.Series([[row['column 4']] for x in range(len(m))])[m]
df1.loc[m, 'column 5'] += row['column 5']
df1 = df1.drop(columns='setc')
# NaN where nothing matched
df1.loc[df1['column 4'].str.len().eq(0), ['column 4', 'column 5']] = np.NaN
column 1 column 2 column 3 column 4 column 5
0 bla bla sample1 sample5 sample2 a f [(sample1, sample2), (sample1, sample5)] 68.0
1 bla bla sample1 bla bla sample5 b g [(sample1, sample5)] 18.0
2 bla bla sample3 bla bla sample4 c h [(sample3, sample4)] 35.0
3 bla bla sample8 bla bla sample7 d i NaN NaN
4 bla bla sample1 bla bla sample2 e j [(sample1, sample2)] 50.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.