简体   繁体   中英

changing a dataframe based on another dataframe in python

Assume that we have a dataframe df1[1000,7] and another dataframe df2[300,6].
I am setting an example to understand my problem.

df1:  
 col1    col2      col3           ...     col8     newcol
 alex    george    whatever       ...     maria    nan
 julia   bill      sea            ...     paul     nan
 ...     ...        ...           ...     ...      nan
 chris   paul      micheal        ...     alex     nan

my df2 is a subset of df for example the following:

df2:
 col1    col2      col3           ...     col8     
 alex    george    whatever       ...     maria    
 chris   paul      micheal        ...     alex     

I want the new column in my df to get specific values depending if the row exists in my df2. For example let s say here I want to write 'ok' if the row exists and 'missing' if it doesn't. Thus my df should become

df1:  
 col1    col2      col3           ...     col8     newcol
 alex    george    whatever       ...     maria    ok
 julia   bill      sea            ...     paul     missing
 ...     ...        ...           ...     ...      ...
 chris   paul      micheal        ...     alex     ok

how can I do that with one command without using for loops in python? I have thought of using fill.na command to write "missing" after i have inserted the values for the rows that exist in my df, however I am a bit confused on the syntax on how to fill the values for the rows that exist in df2

EDIT: I noticed I inserted the newcol by accident in df2. It does not exist there

I would simply merge (although technically there's no extra column you are merging) the 2 dataframes using the left as your "main" one. If you include the indicator though, it will create a column telling you if the rows are in both dataframes. So here, obviously if you see "both" , it's "ok" and "left_only" , it is "missing".

import pandas as pd
import numpy as np

df1 = pd.DataFrame([
    ['alex','george','maria'],
    ['julia','bill','paul'],
    ['chris','paul','alex']
    ], columns = ['col1','col2', 'col3'])


df2 = pd.DataFrame([
    ['alex','george','maria'],
    ['chris','paul','alex']
    ], columns = ['col1','col2', 'col3'])


merged = df1.merge(df2, how='left', on=list(df1.columns), indicator=True)
df1['newcol'] = np.where(merged['_merge'] == 'both', 'ok', 'missing')

Output:

print(merged)
    col1    col2   col3     _merge
0   alex  george  maria       both
1  julia    bill   paul  left_only
2  chris    paul   alex       both

And with the new column to df1 :

print(df1)
    col1    col2   col3   newcol
0   alex  george  maria       ok
1  julia    bill   paul  missing
2  chris    paul   alex       ok

You might want to try this -

n_columns = len(df1.columns)
df1["newcol"] = (df1 == df2).apply(sum, axis=1).apply(lambda x: "ok" if x == n_columns else "missing")
df1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM