简体   繁体   English

如何找到两个熊猫数据框之间的共享条目,并使用它们在两个数据框中创建相同的列?

[英]How to find shared entries between two pandas data frames and use them to create an identical column in both data frames?

Using pandas, I have created two data frames similar to the below. 使用熊猫,我创建了两个类似于以下的数据框。

input_df1 = pd.DataFrame({'names':['phone,mobile,cell','boat,ship','car'], 'values':[1,3,3]})
input_df2 = pd.DataFrame({'names':['cell,phone','car,automobile', 'boat'], 'values':[3,7,1]})

在此处输入图片说明

I would like to change the 'names' column to contain only names that are found in both data frame. 我想将“名称”列更改为仅包含在两个数据框中都找到的名称。 The goal is have the values for these names automatically group together in comparison plots. 目的是让这些名称的值在比较图中自动分组在一起。 The final data frames should look something like what is shown below. 最终数据帧应类似于以下所示。

在此处输入图片说明

It is important that the names are not reduced to only one name if there are more than one name available in both data frames, as in row zero above. 重要的是,如果两个数据帧中都有多个名称可用,则名称不能仅减少为一个名称,如上面的零行所示。 Rows that do not have common names between them should preferably be removed (but I can do this manually beforehand also). 最好删除它们之间没有通用名称的行(但我也可以事先手动执行此操作)。 Preferably this should also be done without a for loop since the actual data frame is over 50k rows. 优选地,这也应该在没有for循环的情况下完成,因为实际数据帧超过5万行。

I have tried playing around with input_df.names.str.contains() and input_df.names.isin() , but I can't figure out how to find a name in input_df1 that matches a name in input_df2 , compare them for the shortest name, and then replace the longer one with the shorter (which is what my mind thinks should be done). 我试图玩弄input_df.names.str.contains()input_df.names.isin()但我无法弄清楚如何找到一个名字input_df1 ,在相匹配的名字input_df2 ,比较它们最短名称,然后将较短的名称替换为较短的名称(我认为应该这样做)。

Here is one strategy to do it. 这是执行此操作的一种策略。

# your data
# =======================================
input_df1 = pd.DataFrame({'names':['phone,mobile,cell','boat,ship','car'], 'values':[1,3,3]})
input_df1

               names  values
0  phone,mobile,cell       1
1          boat,ship       3
2                car       3

input_df2 = pd.DataFrame({'names':['cell,phone','car,automobile', 'boat'], 'values':[3,7,1]})
input_df2

            names  values
0      cell,phone       3
1  car,automobile       7
2            boat       1

We first convert flat name records to a stacked name records. 我们首先将平面名称记录转换为堆叠的名称记录。

# groupby-tostack function
# ===============================
def func(group):
     return pd.Series(group['names'].values[0].split(','))

stacked_names1 = input_df1.groupby(level=0).apply(func)
stacked_names1

0  0     phone
   1    mobile
   2      cell
1  0      boat
   1      ship
2  0       car
dtype: object

stacked_names2 = input_df2.groupby(level=0).apply(func)
stacked_names2

0  0          cell
   1         phone
1  0           car
   1    automobile
2  0          boat
dtype: object

Next, get the common names by using np.intersec1d . 接下来,使用np.intersec1d获得通用名称。

common_names = np.intersect1d(stacked_names1, stacked_names2)
common_names

array(['boat', 'car', 'cell', 'phone'], dtype=object)

Use .isin to keep the valid names. 使用.isin保留有效名称。

stacked_names1.isin(common_names)

0  0     True
   1    False
   2     True
1  0     True
   1    False
2  0     True
dtype: bool

Finally, convert stacked records back to flat records, again by a groupby on outer level index. 最后,再次通过外部索引上的groupby将堆叠的记录转换回平面记录。

def func2(group):
    return pd.Series(','.join(group.values.tolist()))

input_df1['names'] = stacked_names1[stacked_names1.isin(common_names)].groupby(level=0).apply(func2).values
input_df1

        names  values
0  phone,cell       1
1        boat       3
2         car       3

input_df2['names'] = stacked_names2[stacked_names2.isin(common_names)].groupby(level=0).apply(func2).values
input_df2

        names  values
0  cell,phone       3
1         car       7
2        boat       1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 python 数据帧 pandas 中查找相同的组 - Find identical groups in python data frames pandas Pandas - 基于列条目的两个数据帧的交集 - Pandas - intersection of two data frames based on column entries 如果列 d 类型不同,如何在两个数据框都具有的列上合并两个数据框? - How can you merge two data frames on a column that both data frames have if the column d types are not the same? 如果两个数据框中都存在列及其值,如何合并两个数据框? - How do I merge two data frames if a column and it's values exist in both data frames? Python Pandas:比较一列中的两个数据帧,并返回另一个数据帧中两个数据帧的行内容 - Python Pandas : compare two data-frames along one column and return content of rows of both data frames in another data frame 在两个数据框之间减去熊猫 - Pandas Subtracting between two Data Frames Pandas 列绑定(cbind)两个数据框 - Pandas column bind (cbind) two data frames 如何用熊猫计算两个数据框之间的百分比差异? - How to calculate percentage difference between two data frames with Pandas? 如何在两个熊猫数据框之间应用功能 - How to apply a function between two pandas data frames 如何根据pandas python中的特定列合并两个数据框? - how to merge two data frames based on particular column in pandas python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM