简体   繁体   中英

How to find shared entries between two pandas data frames and use them to create an identical column in both data frames?

Using pandas, I have created two data frames similar to the below.

input_df1 = pd.DataFrame({'names':['phone,mobile,cell','boat,ship','car'], 'values':[1,3,3]})
input_df2 = pd.DataFrame({'names':['cell,phone','car,automobile', 'boat'], 'values':[3,7,1]})

在此处输入图片说明

I would like to change the 'names' column to contain only names that are found in both data frame. The goal is have the values for these names automatically group together in comparison plots. The final data frames should look something like what is shown below.

在此处输入图片说明

It is important that the names are not reduced to only one name if there are more than one name available in both data frames, as in row zero above. Rows that do not have common names between them should preferably be removed (but I can do this manually beforehand also). Preferably this should also be done without a for loop since the actual data frame is over 50k rows.

I have tried playing around with input_df.names.str.contains() and input_df.names.isin() , but I can't figure out how to find a name in input_df1 that matches a name in input_df2 , compare them for the shortest name, and then replace the longer one with the shorter (which is what my mind thinks should be done).

Here is one strategy to do it.

# your data
# =======================================
input_df1 = pd.DataFrame({'names':['phone,mobile,cell','boat,ship','car'], 'values':[1,3,3]})
input_df1

               names  values
0  phone,mobile,cell       1
1          boat,ship       3
2                car       3

input_df2 = pd.DataFrame({'names':['cell,phone','car,automobile', 'boat'], 'values':[3,7,1]})
input_df2

            names  values
0      cell,phone       3
1  car,automobile       7
2            boat       1

We first convert flat name records to a stacked name records.

# groupby-tostack function
# ===============================
def func(group):
     return pd.Series(group['names'].values[0].split(','))

stacked_names1 = input_df1.groupby(level=0).apply(func)
stacked_names1

0  0     phone
   1    mobile
   2      cell
1  0      boat
   1      ship
2  0       car
dtype: object

stacked_names2 = input_df2.groupby(level=0).apply(func)
stacked_names2

0  0          cell
   1         phone
1  0           car
   1    automobile
2  0          boat
dtype: object

Next, get the common names by using np.intersec1d .

common_names = np.intersect1d(stacked_names1, stacked_names2)
common_names

array(['boat', 'car', 'cell', 'phone'], dtype=object)

Use .isin to keep the valid names.

stacked_names1.isin(common_names)

0  0     True
   1    False
   2     True
1  0     True
   1    False
2  0     True
dtype: bool

Finally, convert stacked records back to flat records, again by a groupby on outer level index.

def func2(group):
    return pd.Series(','.join(group.values.tolist()))

input_df1['names'] = stacked_names1[stacked_names1.isin(common_names)].groupby(level=0).apply(func2).values
input_df1

        names  values
0  phone,cell       1
1        boat       3
2         car       3

input_df2['names'] = stacked_names2[stacked_names2.isin(common_names)].groupby(level=0).apply(func2).values
input_df2

        names  values
0  cell,phone       3
1         car       7
2        boat       1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM