简体   繁体   中英

Substitute values from one pandas data frame to another based on condition

I've got two data frames with multiple columns.

df_1 = pd.DataFrame({'A': ['x', '-', 'z'], 'B': [1, 6, 9], 'C': [2, 1, '-']})
> df_1

   A  B  C
0  x  1  2
1  -  6  1
2  z  9  -

df_2 = pd.DataFrame({'A': ['w', 'y', 'y'], 'B': [5, 6, 9], 'C': [2, 1, 8]})
> df_2

   A  B  C
0  w  5  2
1  y  6  1
2  y  9  8

How can I substitute values from one data frame to values from another, based on condition (boolean mask)? Here, missing values noted as '-' , and I want to use values from df_2 instead to get result:

> df
   A  B  C
0  x  1  2
1  y  6  1
2  z  9  8   

IIUC you can create boolean mask with converting values to string by astype and then compare with - . Last fill values with - from another DataFrame by mask or where with inverting mask by ~ :

mask = df_1.astype(str) == '-'
print (mask)
       A      B      C
0  False  False  False
1   True  False  False
2  False  False   True

print (df_1.mask(mask, df_2))
   A  B  C
0  x  1  2
1  y  6  1
2  z  9  8

print (df_1.where(~mask, df_2))
   A  B  C
0  x  1  2
1  y  6  1
2  z  9  8

EDIT by comment:

One possible solution is by su79eu7k , thank you:

masks = [('A', r'[a-zA-Z]'), ('B', r'\d'), ('C', r'\d')]; 
print pd.concat([~(df_1[col].astype(str).str.contains(regex)) for col, regex in masks], axis=1)

Another solution create mask - first fillna possible NaN values, then replace missed values from dict to NaN and last find isnull values.

import pandas as pd
import numpy as np

df_1 = pd.DataFrame({'A': ['-x', '-', np.nan],'B': [1, 6, 'Unknown'],'C': [2, 1, 'Missing']})
print (df_1)

df_2 = pd.DataFrame({'A': ['w', 'y', 'y'], 'B': [5, 6, 9], 'C': [2, 1, 8]})
print (df_2)

mask_li = ['-','Unknown','Missing']  
d = {x:np.nan for x in mask_li}  

mask = df_1.fillna(1).replace(d).isnull()
print (mask)
       A      B      C
0  False  False  False
1   True  False  False
2  False   True   True

print (df_1.mask(mask, df_2))    
     A  B  C
0   -x  1  2
1    y  6  1
2  NaN  9  8

You can use str.contains , but other data cannot contain values from list mask_li :

mask_li = ['-','Unknown','Missing']    

mask= df_1.copy()
for col in df_1.columns:
    mask[col] = mask[col].astype(str).str.contains('|'.join(mask_li))

print (mask)
       A      B      C
0  False  False  False
1   True  False  False
2  False   True   True

print (df_1.mask(mask, df_2))    
   A  B  C
0  x  1  2
1  y  6  1
2  z  9  8

But there can be problem, if another data contains values from mask_li eg - :

Eg:

import pandas as pd
import numpy as np

df_1 = pd.DataFrame({'A': ['-x', '-', '-z'], 'B': [1, 6, 'Unknown'], 'C': [2, 1, 'Missing']})
print (df_1)

df_2 = pd.DataFrame({'A': ['w', 'y', 'y'], 'B': [5, 6, 9], 'C': [2, 1, 8]})
print (df_2)

mask_li = ['-','Unknown','Missing']    

mask= df_1.copy()
for col in df_1.columns:
    mask[col] = mask[col].astype(str).str.contains('|'.join(mask_li))

print (mask)
      A      B      C
0  True  False  False
1  True  False  False
2  True   True   True

print (df_1.mask(mask, df_2))    
   A  B  C
0  w  1  2
1  y  6  1
2  y  9  8

One possible solution:

import pandas as pd
import numpy as np

df_1 = pd.DataFrame({'A': ['-x', '-', '-z'], 'B': [1, 6, 'Unknown'], 'C': [2, 1, 'Missing']})
print (df_1)

df_2 = pd.DataFrame({'A': ['w', 'y', 'y'], 'B': [5, 6, 9], 'C': [2, 1, 8]})
print (df_2)

mask_li = ['Unknown','Missing']    

mask= df_1.copy()
for col in df_1.columns:
    column = mask[col].astype(str)
    mask[col] = (column.str.contains('|'.join(mask_li))) | (column == '-')

print (mask)
       A      B      C
0  False  False  False
1   True  False  False
2  False   True   True

print (df_1.mask(mask, df_2))    
    A  B  C
0  -x  1  2
1   y  6  1
2  -z  9  8

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM