简体   繁体   中英

Python pandas: replace values multiple columns matching multiple columns from another dataframe

I searched a lot for an answer, the closest question was Compare 2 columns of 2 different pandas dataframes, if the same insert 1 into the other in Python , but the answer to this person's particular problem was a simple merge, which doesn't answer the question in a general way.

I have two large dataframes, df1 (typically about 10 million rows), and df2 (about 130 million rows). I need to update values in three columns of df1 with values from three columns of df2, based on two df1 columns matching two df2 columns. It is imperative that the order of df1 remains unchanged, and that only rows with matching values get updated.

This is how the dataframes look like:

df1

chr    snp  x    pos a1 a2
1  1-10020  0  10020  G  A    
1  1-10056  0  10056  C  G    
1  1-10108  0  10108  C  G
1  1-10109  0  10109  C  G    
1  1-10139  0  10139  C  T

Note that it's not always the case that the values of "snp" is chr-pos, it can take many other values with no link to any of the columns (like rs1234, indel-6032 etc)

df2

ID           CHR   STOP  OCHR  OSTOP
rs376643643    1  10040     1  10020
rs373328635    1  10066     1  10056    
rs62651026     1  10208     1  10108    
rs376007522    1  10209     1  10109   
rs368469931    3  30247     1  10139

I need to update ['snp', 'chr', 'pos'] in df1 with df2[['ID', 'OCHR', 'OSTOP']] only when df1[['chr', 'pos']] matches df2[['OCHR', 'OSTOP']]

so in this case, after update, df1 would look like:

chr       snp  x     pos a1 a2    
1  rs376643643  0  10040  G  A    
1  rs373328635  0  10066  C  G    
1  rs62651026   0  10208  C  G    
1  rs376007522  0  10209  C  G    
3  rs368469931  0  30247  C  T

I have used merge as a workaround:

df1 = pd.merge(df1, df2, how='left', left_on=["chr", "pos"], right_on=["OCHR", "OSTOP"],
                                     left_index=False, right_index=False, sort=False)

and then

df1.loc[~df1.OCHR.isnull(), ["snp", "chr", "pos"]] = df1.loc[~df1.OCHR.isnull(), ["ID", "CHR", "STOP"]].values

and then remove the extra columns.

Yes, it works, but what would be a way to do that directly by comparing the values from both dataframes, I just don't know how to formulate it, and I couldn't find an answer anywhere; I guess it could be useful to get a general answer on this.

I tried that but it doesn't work:

df1.loc[(df1.chr==df2.OCHR) & (df1.pos==df2.OSTOP),["snp", "chr", "pos"]] = df2.loc[df2[['OCHR', 'OSTOP']] == df1.loc[(df1.chr==df2.OCHR) & (df1.pos==df2.OSTOP),["chr", "pos"]],['ID', ''CHR', 'STOP']].values

Thanks,

Stephane

You can use the update function (requires setting the matching criteria to index). I've modified your sample data to allow some mismatch.

# your data
# =====================
# df1 pos is modified from 10020 to 10010
print(df1)

   chr      snp  x    pos a1 a2
0    1  1-10020  0  10010  G  A
1    1  1-10056  0  10056  C  G
2    1  1-10108  0  10108  C  G
3    1  1-10109  0  10109  C  G
4    1  1-10139  0  10139  C  T

print(df2)

            ID  CHR   STOP  OCHR  OSTOP
0  rs376643643    1  10040     1  10020
1  rs373328635    1  10066     1  10056
2   rs62651026    1  10208     1  10108
3  rs376007522    1  10209     1  10109
4  rs368469931    3  30247     1  10139

# processing
# ==========================
# set matching columns to multi-level index
x1 = df1.set_index(['chr', 'pos'])['snp']
x2 = df2.set_index(['OCHR', 'OSTOP'])['ID']
# call update function, this is inplace
x1.update(x2)
# replace the values in original df1
df1['snp'] = x1.values
print(df1)

   chr          snp  x    pos a1 a2
0    1      1-10020  0  10010  G  A
1    1  rs373328635  0  10056  C  G
2    1   rs62651026  0  10108  C  G
3    1  rs376007522  0  10109  C  G
4    1  rs368469931  0  10139  C  T

Start by renaiming the columns you want to merge in df2

df2.rename(columns={'OCHR':'chr','OSTOP':'pos'},inplace=True)

Now merge on these columns

df_merged = pd.merge(df1, df2, how='inner', on=['chr', 'pos']) # you might have to preserve the df1 index at this stage, not sure

Next, you want to

updater = df_merged[['D','CHR','STOP']] #this will be your update frame
updater.rename( columns={'D':'snp','CHR':'chr','STOP':'pos'},inplace=True) # rename columns to update original

Finally update (see bottom of this link ):

df1.update( df1_updater) #updates in place
#  chr          snp  x    pos a1 a2
#0   1  rs376643643  0  10040  G  A
#1   1  rs373328635  0  10066  C  G
#2   1   rs62651026  0  10208  C  G
#3   1  rs376007522  0  10209  C  G
#4   3  rs368469931  0  30247  C  T

update works by matching index/column so you might have to string along the index of df1 for the entire process, then do df1_updater.re_index(... before df1.update(df1_updater)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM