Fastest way to replace multiple values of a pandas dataframe with values from another dataframe

Question

I am trying to replace multiple rows of pandas dataframe, with values from another dataframe.

Supposed I have 10,000 rows of customer_id in my dataframe df1 and I want to replace these customer_id with 3,000 values from df2 .

For the sake of illustration, let's generate the dataframes (below).

Say these 10 rows in df1 represent 10,000 rows, and the 3 rows from df2 represent 3,000 values.

import numpy as np
import pandas as pd
np.random.seed(42)

# Create df1 with unique values
arr1 = np.arange(100,200,10)
np.random.shuffle(arr1)
df1 = pd.DataFrame(data=arr1, 
                   columns=['customer_id'])

# Create df2 for new unique_values
df2 = pd.DataFrame(data = [1800, 1100, 1500],
                   index = [180, 110, 150], # this is customer_id column on df1
                   columns = ['customer_id_new'])

I want to replace 180 with 1800, 110 with 1100, and 150 with 1500.

I know we can do below ...

# Replace multiple values
replace_values = {180 : 1800, 110 : 1100, 150 : 1500 }                                                                                          
df1_replaced = df1.replace({'customer_id': replace_values})

And it works fine if I only have a few lines...

But if I have thousands of lines that I need to replace, how could I do this without typing out what values I want to change one at a time?

EDIT: To clarify, I don't need to use replace . Anything that could replace those values in df1 from values in df2 in the fastest most efficient way is ok.

Answer 1

df1['customer_id'] = df1['customer_id'].replace(df2['customer_id_new'])

或者，您可以就地进行。

df1['customer_id'].replace(df2['customer_id_new'], inplace=True)

Answer 2

You can try this, using map with a pd.Series:

 df1['customer_id'] = df1['customer_id'].map(df2.squeeze()).fillna(df1['customer_id'])

or

df1['customer_id'] = df1['customer_id'].map(df2['customer_id_new']).fillna(df1['customer_id'])

Output:

   customer_id
0       1800.0
1       1100.0
2       1500.0
3        100.0
4        170.0
5        120.0
6        190.0
7        140.0
8        130.0
9        160.0

Answer 3

Going with your original method using replace , you can simplify it with to_dict to create your mapping dictionary without having to do it manually:

df1["customer_id"] = df1["customer_id"].replace(df2["customer_id_new"].to_dict())

>>> df1
   customer_id
0         1800
1         1100
2         1500
3          100
4          170
5          120
6          190
7          140
8          130
9          160

Answer 4

In my opinion, apart from trying out useful answers mentioned above, you may try parallelising your data-frame in-case you have multi-core processor.

For example:

import pandas as pd, numpy as np, seaborn as sns
from multiprocessing import Pool

num_partitions = 10 #number of partitions to split data-frame
num_cores = 4 #number of cores on your machine

iris = pd.DataFrame(sns.load_dataset('iris'))
def parallelize_dataframe(df, func):
   df_split = np.array_split(df, num_partitions)
   pool = Pool(num_cores)
   df = pd.concat(pool.map(func, df_split))
   pool.close()
   pool.join()
   return df

In place of 'func' parameter, you may pass your replace method. Please let me know if it helps. In case of any error, do comment.

Thanks!

Fastest way to replace multiple values of a pandas dataframe with values from another dataframe

Question

4 answers

solution1
3 ACCPTED 2019-07-20 03:12:25

solution2
2 2019-07-20 02:57:29

solution3
1 2019-07-20 03:06:51

solution4
1 2019-07-20 04:26:08

Fastest way to replace multiple values of a pandas dataframe with values from another dataframe

Question

4 answers

solution1 3 ACCPTED 2019-07-20 03:12:25

solution2 2 2019-07-20 02:57:29

solution3 1 2019-07-20 03:06:51

solution4 1 2019-07-20 04:26:08

solution1
3 ACCPTED 2019-07-20 03:12:25

solution2
2 2019-07-20 02:57:29

solution3
1 2019-07-20 03:06:51

solution4
1 2019-07-20 04:26:08