基于两列值有效地从熊猫数据框中提取信息

Question

I am trying to extract information from a data frame which is indexed by productId and customerId.我正在尝试从由 productId 和 customerId 索引的数据框中提取信息。 I have a large number (millions) of (productId, customerId) pairs and am interested in finding the most efficient way possible to do this.我有大量（数百万）（productId，customerId）对，并且有兴趣找到最有效的方法来做到这一点。

I have two data frames, df1 containing the customerId, productId pairs I'm interested in, and a second frame df2 containing information of interest which is indexed by customerId, productId pairs.我有两个数据帧，df1 包含我感兴趣的 customerId、productId 对，第二个帧 df2 包含感兴趣的信息，由 customerId、productId 对索引。

So far I have tried something like:到目前为止，我已经尝试过类似的事情：

def f(x, y):
    return(df2.col[(df2.customerId == x) & (df2.productId == y)].sum())

values = df1.apply(lambda x: f(x.customerId, x.productId), axis = 1)

which works fine but is very slow.工作正常，但速度很慢。

Any suggestions on improvements?有什么改进建议吗？

Answer 1

您可以尝试列表理解：

values = [df2.loc[df2[['customerId', 'productId']].eq(i).all(), 'col'].sum() for i in df1.values]

基于两列值有效地从熊猫数据框中提取信息

问题描述

1 个解决方案

解决方案1
0 2020-01-27 09:58:54

基于两列值有效地从熊猫数据框中提取信息

问题描述

1 个解决方案

解决方案1 0 2020-01-27 09:58:54

解决方案1
0 2020-01-27 09:58:54