如何找到两个熊猫数据框之间的共享条目，并使用它们在两个数据框中创建相同的列？

Question

Using pandas, I have created two data frames similar to the below. 使用熊猫，我创建了两个类似于以下的数据框。

input_df1 = pd.DataFrame({'names':['phone,mobile,cell','boat,ship','car'], 'values':[1,3,3]})
input_df2 = pd.DataFrame({'names':['cell,phone','car,automobile', 'boat'], 'values':[3,7,1]})

I would like to change the 'names' column to contain only names that are found in both data frame. 我想将“名称”列更改为仅包含在两个数据框中都找到的名称。 The goal is have the values for these names automatically group together in comparison plots. 目的是让这些名称的值在比较图中自动分组在一起。 The final data frames should look something like what is shown below. 最终数据帧应类似于以下所示。

It is important that the names are not reduced to only one name if there are more than one name available in both data frames, as in row zero above. 重要的是，如果两个数据帧中都有多个名称可用，则名称不能仅减少为一个名称，如上面的零行所示。 Rows that do not have common names between them should preferably be removed (but I can do this manually beforehand also). 最好删除它们之间没有通用名称的行（但我也可以事先手动执行此操作）。 Preferably this should also be done without a for loop since the actual data frame is over 50k rows. 优选地，这也应该在没有for循环的情况下完成，因为实际数据帧超过5万行。

I have tried playing around with input_df.names.str.contains() and input_df.names.isin() , but I can't figure out how to find a name in input_df1 that matches a name in input_df2 , compare them for the shortest name, and then replace the longer one with the shorter (which is what my mind thinks should be done). 我试图玩弄input_df.names.str.contains()和input_df.names.isin()但我无法弄清楚如何找到一个名字input_df1 ，在相匹配的名字input_df2 ，比较它们最短名称，然后将较短的名称替换为较短的名称（我认为应该这样做）。

Answer 1

Here is one strategy to do it. 这是执行此操作的一种策略。

# your data
# =======================================
input_df1 = pd.DataFrame({'names':['phone,mobile,cell','boat,ship','car'], 'values':[1,3,3]})
input_df1

               names  values
0  phone,mobile,cell       1
1          boat,ship       3
2                car       3

input_df2 = pd.DataFrame({'names':['cell,phone','car,automobile', 'boat'], 'values':[3,7,1]})
input_df2

            names  values
0      cell,phone       3
1  car,automobile       7
2            boat       1

We first convert flat name records to a stacked name records. 我们首先将平面名称记录转换为堆叠的名称记录。

# groupby-tostack function
# ===============================
def func(group):
     return pd.Series(group['names'].values[0].split(','))

stacked_names1 = input_df1.groupby(level=0).apply(func)
stacked_names1

0  0     phone
   1    mobile
   2      cell
1  0      boat
   1      ship
2  0       car
dtype: object

stacked_names2 = input_df2.groupby(level=0).apply(func)
stacked_names2

0  0          cell
   1         phone
1  0           car
   1    automobile
2  0          boat
dtype: object

Next, get the common names by using np.intersec1d . 接下来，使用np.intersec1d获得通用名称。

common_names = np.intersect1d(stacked_names1, stacked_names2)
common_names

array(['boat', 'car', 'cell', 'phone'], dtype=object)

Use .isin to keep the valid names. 使用.isin保留有效名称。

stacked_names1.isin(common_names)

0  0     True
   1    False
   2     True
1  0     True
   1    False
2  0     True
dtype: bool

Finally, convert stacked records back to flat records, again by a groupby on outer level index. 最后，再次通过外部索引上的groupby将堆叠的记录转换回平面记录。

def func2(group):
    return pd.Series(','.join(group.values.tolist()))

input_df1['names'] = stacked_names1[stacked_names1.isin(common_names)].groupby(level=0).apply(func2).values
input_df1

        names  values
0  phone,cell       1
1        boat       3
2         car       3

input_df2['names'] = stacked_names2[stacked_names2.isin(common_names)].groupby(level=0).apply(func2).values
input_df2

        names  values
0  cell,phone       3
1         car       7
2        boat       1

如何找到两个熊猫数据框之间的共享条目，并使用它们在两个数据框中创建相同的列？

问题描述

1 个解决方案

解决方案1
0 2015-07-24 13:12:08

如何找到两个熊猫数据框之间的共享条目，并使用它们在两个数据框中创建相同的列？

问题描述

1 个解决方案

解决方案1 0 2015-07-24 13:12:08

解决方案1
0 2015-07-24 13:12:08