[英]How to find shared entries between two pandas data frames and use them to create an identical column in both data frames?
Using pandas, I have created two data frames similar to the below. 使用熊猫,我创建了两个类似于以下的数据框。
input_df1 = pd.DataFrame({'names':['phone,mobile,cell','boat,ship','car'], 'values':[1,3,3]})
input_df2 = pd.DataFrame({'names':['cell,phone','car,automobile', 'boat'], 'values':[3,7,1]})
I would like to change the 'names' column to contain only names that are found in both data frame. 我想将“名称”列更改为仅包含在两个数据框中都找到的名称。 The goal is have the values for these names automatically group together in comparison plots.
目的是让这些名称的值在比较图中自动分组在一起。 The final data frames should look something like what is shown below.
最终数据帧应类似于以下所示。
It is important that the names are not reduced to only one name if there are more than one name available in both data frames, as in row zero above. 重要的是,如果两个数据帧中都有多个名称可用,则名称不能仅减少为一个名称,如上面的零行所示。 Rows that do not have common names between them should preferably be removed (but I can do this manually beforehand also).
最好删除它们之间没有通用名称的行(但我也可以事先手动执行此操作)。 Preferably this should also be done without a for loop since the actual data frame is over 50k rows.
优选地,这也应该在没有for循环的情况下完成,因为实际数据帧超过5万行。
I have tried playing around with input_df.names.str.contains()
and input_df.names.isin()
, but I can't figure out how to find a name in input_df1
that matches a name in input_df2
, compare them for the shortest name, and then replace the longer one with the shorter (which is what my mind thinks should be done). 我试图玩弄
input_df.names.str.contains()
和input_df.names.isin()
但我无法弄清楚如何找到一个名字input_df1
,在相匹配的名字input_df2
,比较它们最短名称,然后将较短的名称替换为较短的名称(我认为应该这样做)。
Here is one strategy to do it. 这是执行此操作的一种策略。
# your data
# =======================================
input_df1 = pd.DataFrame({'names':['phone,mobile,cell','boat,ship','car'], 'values':[1,3,3]})
input_df1
names values
0 phone,mobile,cell 1
1 boat,ship 3
2 car 3
input_df2 = pd.DataFrame({'names':['cell,phone','car,automobile', 'boat'], 'values':[3,7,1]})
input_df2
names values
0 cell,phone 3
1 car,automobile 7
2 boat 1
We first convert flat name records to a stacked name records. 我们首先将平面名称记录转换为堆叠的名称记录。
# groupby-tostack function
# ===============================
def func(group):
return pd.Series(group['names'].values[0].split(','))
stacked_names1 = input_df1.groupby(level=0).apply(func)
stacked_names1
0 0 phone
1 mobile
2 cell
1 0 boat
1 ship
2 0 car
dtype: object
stacked_names2 = input_df2.groupby(level=0).apply(func)
stacked_names2
0 0 cell
1 phone
1 0 car
1 automobile
2 0 boat
dtype: object
Next, get the common names by using np.intersec1d
. 接下来,使用
np.intersec1d
获得通用名称。
common_names = np.intersect1d(stacked_names1, stacked_names2)
common_names
array(['boat', 'car', 'cell', 'phone'], dtype=object)
Use .isin
to keep the valid names. 使用
.isin
保留有效名称。
stacked_names1.isin(common_names)
0 0 True
1 False
2 True
1 0 True
1 False
2 0 True
dtype: bool
Finally, convert stacked records back to flat records, again by a groupby on outer level index. 最后,再次通过外部索引上的groupby将堆叠的记录转换回平面记录。
def func2(group):
return pd.Series(','.join(group.values.tolist()))
input_df1['names'] = stacked_names1[stacked_names1.isin(common_names)].groupby(level=0).apply(func2).values
input_df1
names values
0 phone,cell 1
1 boat 3
2 car 3
input_df2['names'] = stacked_names2[stacked_names2.isin(common_names)].groupby(level=0).apply(func2).values
input_df2
names values
0 cell,phone 3
1 car 7
2 boat 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.