Using pandas, I have created two data frames similar to the below.
input_df1 = pd.DataFrame({'names':['phone,mobile,cell','boat,ship','car'], 'values':[1,3,3]})
input_df2 = pd.DataFrame({'names':['cell,phone','car,automobile', 'boat'], 'values':[3,7,1]})
I would like to change the 'names' column to contain only names that are found in both data frame. The goal is have the values for these names automatically group together in comparison plots. The final data frames should look something like what is shown below.
It is important that the names are not reduced to only one name if there are more than one name available in both data frames, as in row zero above. Rows that do not have common names between them should preferably be removed (but I can do this manually beforehand also). Preferably this should also be done without a for loop since the actual data frame is over 50k rows.
I have tried playing around with input_df.names.str.contains()
and input_df.names.isin()
, but I can't figure out how to find a name in input_df1
that matches a name in input_df2
, compare them for the shortest name, and then replace the longer one with the shorter (which is what my mind thinks should be done).
Here is one strategy to do it.
# your data
# =======================================
input_df1 = pd.DataFrame({'names':['phone,mobile,cell','boat,ship','car'], 'values':[1,3,3]})
input_df1
names values
0 phone,mobile,cell 1
1 boat,ship 3
2 car 3
input_df2 = pd.DataFrame({'names':['cell,phone','car,automobile', 'boat'], 'values':[3,7,1]})
input_df2
names values
0 cell,phone 3
1 car,automobile 7
2 boat 1
We first convert flat name records to a stacked name records.
# groupby-tostack function
# ===============================
def func(group):
return pd.Series(group['names'].values[0].split(','))
stacked_names1 = input_df1.groupby(level=0).apply(func)
stacked_names1
0 0 phone
1 mobile
2 cell
1 0 boat
1 ship
2 0 car
dtype: object
stacked_names2 = input_df2.groupby(level=0).apply(func)
stacked_names2
0 0 cell
1 phone
1 0 car
1 automobile
2 0 boat
dtype: object
Next, get the common names by using np.intersec1d
.
common_names = np.intersect1d(stacked_names1, stacked_names2)
common_names
array(['boat', 'car', 'cell', 'phone'], dtype=object)
Use .isin
to keep the valid names.
stacked_names1.isin(common_names)
0 0 True
1 False
2 True
1 0 True
1 False
2 0 True
dtype: bool
Finally, convert stacked records back to flat records, again by a groupby on outer level index.
def func2(group):
return pd.Series(','.join(group.values.tolist()))
input_df1['names'] = stacked_names1[stacked_names1.isin(common_names)].groupby(level=0).apply(func2).values
input_df1
names values
0 phone,cell 1
1 boat 3
2 car 3
input_df2['names'] = stacked_names2[stacked_names2.isin(common_names)].groupby(level=0).apply(func2).values
input_df2
names values
0 cell,phone 3
1 car 7
2 boat 1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.