简体   繁体   中英

Check matches between two data frame whatever the row with pandas

I have two data frame such as :

>>> df1

query   target     
A:1     AZ     
B:4     AZ  
C:5     AZ    
D:1     AZ  

>>> df2

query   target
B:6     AZ
C:5     AZ
D:1     AZ
A:1     AZ

And the idea is simply to check if the values present in the df1['query'] are present in the df2['query'] as well, whatever the order of the row and add a new column the df1 and get:

>>> df1

query   target    new_col 
A:1     AZ        present
B:4     AZ        Not_present
C:5     AZ        present
D:1     AZ        present

I tried : df1["new_col"] = df2.apply(lambda row: "present" if row[0] == df1["query"][row.name] else "Not_present", axis = 1)

but it only check match by rows.

Thanks for your help.

EDIT

What if know I have to compare 3 data frames with the df1

Here is the new exemple:

df1 

query
A1
A2
B3
B5
B6
B7
C8
C9

df2

query target
C9    type2
Z6    type2

df3
query target
C10   type3
B6    type3

df4
query target
A1    type4
K9    type1

and I will do a loop such as :

for df in dataframes: 
   df1['new_col'] = np.where(blast['query'].isin(df['query']), 'Present', 'Not_absent')

the issue is that it will override each time the column df1['New_col']

At the end I should get :

df1 

    query   new_col
    A1      present_type4
    A2.     not_present
    B3.     not_present
    B5.     not_present
    B6.     present_type3
    B7.     not_present
    C8.     not_present
    C9.     present_type2

Edit for jezrael :

In order to open my data frames, I have a file.txt file such as:

Species1
Species2
Species3

It helps to call the wright path link where a data frame is for exemple:

/admin/user/project/Species1/dataframe.txt etc

so I juste call them to create the df such as :

for i in file.txt:
 df = open("/admin/user/project/"+i+"/dataframe.txt","r")

Then I have as I said above to find matches between all these data frame and one big data frame (df1) .

By doing:

values=[]
for names in file.txt:
    values.append("/admin/user/project/"+i+"/dataframe.txt") 

for names file.txt:
    keys.append(names)

dicts = {}
for i in keys:
        dicts[i] = values[i]
d = {}
for i in range(len(keys)):
    d[i]=None

for i in range(len(keys)):
    d[keys[i]] = d.pop(i)

for (k,v), i in zip( d.items(),values):
    d[k] = i

I succeed to get something as you shown me:

but the values are the path to open for the data frames:

>>> d
{'Species1': '/admin/user/project/Species1/dataframe.txt', 'Species2': '/admin/user/project/Species2/dataframe.txt', 'Species3': '/admin/user/project/Species3/dataframe.txt'}

Use numpy.where with Series.isin :

df1['new_col'] = np.where(df1['query'].isin(df2['query']), 'present', 'Not_present')
print (df1)
  query target      new_col
0   A:1     AZ      present
1   B:4     AZ  Not_present
2   C:5     AZ      present
3   D:1     AZ      present

EDIT:

d = {'type2':df2, 'type3':df3, 'type4':df4}
df1['new_col'] = 'not_present'
for k, v in d.items(): 
   df1.loc[df1['query'].isin(v['query']), 'new_col'] = 'Present_{}'.format(k)

print (df1)
  query        new_col
0    A1  Present_type4
1    A2    not_present
2    B3    not_present
3    B5    not_present
4    B6  Present_type3
5    B7    not_present
6    C8    not_present
7    C9  Present_type2

EDIT: You can in loop create DataFrame and pass to isin :

d = {'Species1': '/admin/user/project/Species1/dataframe.txt', 'Species2': '/admin/user/project/Species2/dataframe.txt', 'Species3': '/admin/user/project/Species3/dataframe.txt'}

df1['new_col'] = 'not_present'
for k, v in d.items(): 
    df = pd.read_csv(v)
    df1.loc[df1['query'].isin(df['query']), 'new_col'] = 'Present_{}'.format(k)

Solution using df.loc[] :

df1.loc[df1['query'].isin(df2['query']),'new_col']='present'
df1.new_col=df1.new_col.fillna('Not_present')
print(df1)

  query target      new_col
0   A:1     AZ      present
1   B:4     AZ  Not_present
2   C:5     AZ      present
3   D:1     AZ      present

One other solution using pd.merge

df_temp = df_2.copy()
df_temp['new_col'] = 'present'
df_temp = df_temp['query', new_col]
df1 = df1.merge(df_temp, how='left', on='query').fillna('Not_present')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM