[英]Check matches between two data frame whatever the row with pandas
我有两个数据框,例如:
>>> df1
query target
A:1 AZ
B:4 AZ
C:5 AZ
D:1 AZ
>>> df2
query target
B:6 AZ
C:5 AZ
D:1 AZ
A:1 AZ
并且想法只是检查df1['query']
中存在的df2['query']
是否也存在于df2['query']
中,无论行的顺序如何,并添加新列df1并获取:
>>> df1
query target new_col
A:1 AZ present
B:4 AZ Not_present
C:5 AZ present
D:1 AZ present
我试过: df1["new_col"] = df2.apply(lambda row: "present" if row[0] == df1["query"][row.name] else "Not_present", axis = 1)
但它只按行检查匹配。
谢谢你的帮助。
编辑
如果知道我必须将3个数据帧与df1进行比较,该怎么办?
这是新的例子:
df1
query
A1
A2
B3
B5
B6
B7
C8
C9
df2
query target
C9 type2
Z6 type2
df3
query target
C10 type3
B6 type3
df4
query target
A1 type4
K9 type1
我会做一个循环,如:
for df in dataframes:
df1['new_col'] = np.where(blast['query'].isin(df['query']), 'Present', 'Not_absent')
问题是它会在每次列df1 ['New_col']时覆盖
最后我应该得到:
df1
query new_col
A1 present_type4
A2. not_present
B3. not_present
B5. not_present
B6. present_type3
B7. not_present
C8. not_present
C9. present_type2
编辑jezrael
:
为了打开我的数据框,我有一个file.txt
文件,例如:
Species1
Species2
Species3
它有助于调用数据框所在的wright路径链接:
/admin/user/project/Species1/dataframe.txt etc
所以我juste称他们创建df如:
for i in file.txt:
df = open("/admin/user/project/"+i+"/dataframe.txt","r")
然后我按照上面的说法找到所有这些数据帧和一个大数据帧(df1)
之间的匹配。
通过做:
values=[]
for names in file.txt:
values.append("/admin/user/project/"+i+"/dataframe.txt")
for names file.txt:
keys.append(names)
dicts = {}
for i in keys:
dicts[i] = values[i]
d = {}
for i in range(len(keys)):
d[i]=None
for i in range(len(keys)):
d[keys[i]] = d.pop(i)
for (k,v), i in zip( d.items(),values):
d[k] = i
当你向我展示时,我成功得到了一些东西:
但值是数据框打开的路径:
>>> d
{'Species1': '/admin/user/project/Species1/dataframe.txt', 'Species2': '/admin/user/project/Species2/dataframe.txt', 'Species3': '/admin/user/project/Species3/dataframe.txt'}
df1['new_col'] = np.where(df1['query'].isin(df2['query']), 'present', 'Not_present')
print (df1)
query target new_col
0 A:1 AZ present
1 B:4 AZ Not_present
2 C:5 AZ present
3 D:1 AZ present
编辑:
d = {'type2':df2, 'type3':df3, 'type4':df4}
df1['new_col'] = 'not_present'
for k, v in d.items():
df1.loc[df1['query'].isin(v['query']), 'new_col'] = 'Present_{}'.format(k)
print (df1)
query new_col
0 A1 Present_type4
1 A2 not_present
2 B3 not_present
3 B5 not_present
4 B6 Present_type3
5 B7 not_present
6 C8 not_present
7 C9 Present_type2
编辑:您可以在循环创建数据帧,并传递给isin
:
d = {'Species1': '/admin/user/project/Species1/dataframe.txt', 'Species2': '/admin/user/project/Species2/dataframe.txt', 'Species3': '/admin/user/project/Species3/dataframe.txt'}
df1['new_col'] = 'not_present'
for k, v in d.items():
df = pd.read_csv(v)
df1.loc[df1['query'].isin(df['query']), 'new_col'] = 'Present_{}'.format(k)
使用df.loc[]
解决方案:
df1.loc[df1['query'].isin(df2['query']),'new_col']='present'
df1.new_col=df1.new_col.fillna('Not_present')
print(df1)
query target new_col
0 A:1 AZ present
1 B:4 AZ Not_present
2 C:5 AZ present
3 D:1 AZ present
使用pd.merge
另一个解决方案
df_temp = df_2.copy()
df_temp['new_col'] = 'present'
df_temp = df_temp['query', new_col]
df1 = df1.merge(df_temp, how='left', on='query').fillna('Not_present')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.