繁体   English   中英

检查两个数据帧之间的匹配,无论具有pandas的行

[英]Check matches between two data frame whatever the row with pandas

我有两个数据框,例如:

>>> df1

query   target     
A:1     AZ     
B:4     AZ  
C:5     AZ    
D:1     AZ  

>>> df2

query   target
B:6     AZ
C:5     AZ
D:1     AZ
A:1     AZ

并且想法只是检查df1['query']中存在的df2['query']是否也存在于df2['query']中,无论行的顺序如何,并添加新列df1并获取:

>>> df1

query   target    new_col 
A:1     AZ        present
B:4     AZ        Not_present
C:5     AZ        present
D:1     AZ        present

我试过: df1["new_col"] = df2.apply(lambda row: "present" if row[0] == df1["query"][row.name] else "Not_present", axis = 1)

但它只按行检查匹配。

谢谢你的帮助。

编辑

如果知道我必须将3个数据帧与df1进行比较,该怎么办?

这是新的例子:

df1 

query
A1
A2
B3
B5
B6
B7
C8
C9

df2

query target
C9    type2
Z6    type2

df3
query target
C10   type3
B6    type3

df4
query target
A1    type4
K9    type1

我会做一个循环,如:

for df in dataframes: 
   df1['new_col'] = np.where(blast['query'].isin(df['query']), 'Present', 'Not_absent')

问题是它会在每次列df1 ['New_col']时覆盖

最后我应该得到:

df1 

    query   new_col
    A1      present_type4
    A2.     not_present
    B3.     not_present
    B5.     not_present
    B6.     present_type3
    B7.     not_present
    C8.     not_present
    C9.     present_type2

编辑jezrael

为了打开我的数据框,我有一个file.txt文件,例如:

Species1
Species2
Species3

它有助于调用数据框所在的wright路径链接:

/admin/user/project/Species1/dataframe.txt etc

所以我juste称他们创建df如:

for i in file.txt:
 df = open("/admin/user/project/"+i+"/dataframe.txt","r")

然后我按照上面的说法找到所有这些数据帧和一个大数据帧(df1)之间的匹配。

通过做:

values=[]
for names in file.txt:
    values.append("/admin/user/project/"+i+"/dataframe.txt") 

for names file.txt:
    keys.append(names)

dicts = {}
for i in keys:
        dicts[i] = values[i]
d = {}
for i in range(len(keys)):
    d[i]=None

for i in range(len(keys)):
    d[keys[i]] = d.pop(i)

for (k,v), i in zip( d.items(),values):
    d[k] = i

当你向我展示时,我成功得到了一些东西:

但值是数据框打开的路径:

>>> d
{'Species1': '/admin/user/project/Species1/dataframe.txt', 'Species2': '/admin/user/project/Species2/dataframe.txt', 'Species3': '/admin/user/project/Species3/dataframe.txt'}

使用numpy.whereSeries.isin

df1['new_col'] = np.where(df1['query'].isin(df2['query']), 'present', 'Not_present')
print (df1)
  query target      new_col
0   A:1     AZ      present
1   B:4     AZ  Not_present
2   C:5     AZ      present
3   D:1     AZ      present

编辑:

d = {'type2':df2, 'type3':df3, 'type4':df4}
df1['new_col'] = 'not_present'
for k, v in d.items(): 
   df1.loc[df1['query'].isin(v['query']), 'new_col'] = 'Present_{}'.format(k)

print (df1)
  query        new_col
0    A1  Present_type4
1    A2    not_present
2    B3    not_present
3    B5    not_present
4    B6  Present_type3
5    B7    not_present
6    C8    not_present
7    C9  Present_type2

编辑:您可以在循环创建数据帧,并传递给isin

d = {'Species1': '/admin/user/project/Species1/dataframe.txt', 'Species2': '/admin/user/project/Species2/dataframe.txt', 'Species3': '/admin/user/project/Species3/dataframe.txt'}

df1['new_col'] = 'not_present'
for k, v in d.items(): 
    df = pd.read_csv(v)
    df1.loc[df1['query'].isin(df['query']), 'new_col'] = 'Present_{}'.format(k)

使用df.loc[]解决方案:

df1.loc[df1['query'].isin(df2['query']),'new_col']='present'
df1.new_col=df1.new_col.fillna('Not_present')
print(df1)

  query target      new_col
0   A:1     AZ      present
1   B:4     AZ  Not_present
2   C:5     AZ      present
3   D:1     AZ      present

使用pd.merge另一个解决方案

df_temp = df_2.copy()
df_temp['new_col'] = 'present'
df_temp = df_temp['query', new_col]
df1 = df1.merge(df_temp, how='left', on='query').fillna('Not_present')

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM