Efficiently replace values from a column to another column Pandas DataFrame

Question

I have a Pandas DataFrame like this:

   col1 col2 col3
1   0.2  0.3  0.3
2   0.2  0.3  0.3
3     0  0.4  0.4
4     0    0  0.3
5     0    0    0
6   0.1  0.4  0.4

I want to replace the col1 values with the values in the second column ( col2 ) only if col1 values are equal to 0, and after (for the zero values remaining), do it again but with the third column ( col3 ). The Desired Result is the next one:

   col1 col2 col3
1   0.2  0.3  0.3
2   0.2  0.3  0.3
3   0.4  0.4  0.4
4   0.3    0  0.3
5     0    0    0
6   0.1  0.4  0.4

I did it using the pd.replace function, but it seems too slow.. I think must be a faster way to accomplish that.

df.col1.replace(0,df.col2,inplace=True)
df.col1.replace(0,df.col3,inplace=True)

is there a faster way to do that?, using some other function instead of the pd.replace function?

Answer 1

Using np.where is faster. Using a similar pattern as you used with replace :

df['col1'] = np.where(df['col1'] == 0, df['col2'], df['col1'])
df['col1'] = np.where(df['col1'] == 0, df['col3'], df['col1'])

However, using a nested np.where is slightly faster:

df['col1'] = np.where(df['col1'] == 0, 
                      np.where(df['col2'] == 0, df['col3'], df['col2']),
                      df['col1'])

Timings

Using the following setup to produce a larger sample DataFrame and timing functions:

df = pd.concat([df]*10**4, ignore_index=True)

def root_nested(df):
    df['col1'] = np.where(df['col1'] == 0, np.where(df['col2'] == 0, df['col3'], df['col2']), df['col1'])
    return df

def root_split(df):
    df['col1'] = np.where(df['col1'] == 0, df['col2'], df['col1'])
    df['col1'] = np.where(df['col1'] == 0, df['col3'], df['col1'])
    return df

def pir2(df):
    df['col1'] = df.where(df.ne(0), np.nan).bfill(axis=1).col1.fillna(0)
    return df

def pir2_2(df):
    slc = (df.values != 0).argmax(axis=1)
    return df.values[np.arange(slc.shape[0]), slc]

def andrew(df):
    df.col1[df.col1 == 0] = df.col2
    df.col1[df.col1 == 0] = df.col3
    return df

def pablo(df):
    df['col1'] = df['col1'].replace(0,df['col2'])
    df['col1'] = df['col1'].replace(0,df['col3'])
    return df

I get the following timings:

%timeit root_nested(df.copy())
100 loops, best of 3: 2.25 ms per loop

%timeit root_split(df.copy())
100 loops, best of 3: 2.62 ms per loop

%timeit pir2(df.copy())
100 loops, best of 3: 6.25 ms per loop

%timeit pir2_2(df.copy())
1 loop, best of 3: 2.4 ms per loop

%timeit andrew(df.copy())
100 loops, best of 3: 8.55 ms per loop

I tried timing your method, but it's been running for multiple minutes without completing. As a comparison, timing your method on just the 6 row example DataFrame (not the much larger one tested above) took 12.8 ms.

Answer 2

I'm not sure if it's faster, but you're right that you can slice the dataframe to get your desired result.

df.col1[df.col1 == 0] = df.col2
df.col1[df.col1 == 0] = df.col3
print(df)

Output:

   col1  col2  col3
0   0.2   0.3   0.3
1   0.2   0.3   0.3
2   0.4   0.4   0.4
3   0.3   0.0   0.3
4   0.0   0.0   0.0
5   0.1   0.4   0.4

Alternatively if you want it to be more terse (though I don't know if it's faster) you can combine what you did with what I did.

df.col1[df.col1 == 0] = df.col2.replace(0, df.col3)
print(df)

Output:

   col1  col2  col3
0   0.2   0.3   0.3
1   0.2   0.3   0.3
2   0.4   0.4   0.4
3   0.3   0.0   0.3
4   0.0   0.0   0.0
5   0.1   0.4   0.4

Answer 3

approach using pd.DataFrame.where and pd.DataFrame.bfill

df['col1'] = df.where(df.ne(0), np.nan).bfill(axis=1).col1.fillna(0)
df

Another approach using np.argmax

def pir2(df):
    slc = (df.values != 0).argmax(axis=1)
    return df.values[np.arange(slc.shape[0]), slc]

I know there is a better way to use numpy to slice. I just can't think of it at the moment.

Answer 4

Generally speaking, there are three type of methods to do this conditionally replacement task. They are:

numpy.where
pandas.Series.mask or pandas.Series.where which is the opposite of Series.mask
pandas.DataFrame.loc

You can try pandas.Series.mask

df['col1'] = df['col1'].mask(df['col1'].eq(0), df['col2'])
df['col1'] = df['col1'].mask(df['col1'].eq(0), df['col3'])

   col1  col2  col3
1   0.2   0.3   0.3
2   0.2   0.3   0.3
3   0.4   0.4   0.4
4   0.3   0.0   0.3
5   0.0   0.0   0.0
6   0.1   0.4   0.4

Or pandas.Series.where

df['col1'] = df['col1'].where(df['col1'].ne(0), df['col2'])
df['col1'] = df['col1'].where(df['col1'].ne(0), df['col3'])

At last, you can try loc

df.loc[df['col1'].eq(0), 'col1'] = df['col2']
df.loc[df['col1'].eq(0), 'col1'] = df['col3']

Answer 5

Alternatively you can use combine :

replace_zeros = lambda x, y: y if x == 0 else x
df['col1'].combine(df['col2'], func=replace_zeros).combine(df['col3'], func=replace_zeros)

Output:

1    0.2
2    0.2
3    0.4
4    0.3
5    0.0
6    0.1
dtype: float64

Efficiently replace values from a column to another column Pandas DataFrame

Question

5 answers

solution1
53 ACCPTED 2016-10-06 19:11:46

solution2
10 2016-10-06 19:03:41

solution3
3 2016-10-06 19:25:37

solution4
0 2022-05-09 19:59:37

solution5
0 2022-07-09 18:22:10

Efficiently replace values from a column to another column Pandas DataFrame

Question

5 answers

solution1 53 ACCPTED 2016-10-06 19:11:46

solution2 10 2016-10-06 19:03:41

solution3 3 2016-10-06 19:25:37

solution4 0 2022-05-09 19:59:37

solution5 0 2022-07-09 18:22:10

solution1
53 ACCPTED 2016-10-06 19:11:46

solution2
10 2016-10-06 19:03:41

solution3
3 2016-10-06 19:25:37

solution4
0 2022-05-09 19:59:37

solution5
0 2022-07-09 18:22:10