简体   繁体   English

基于两列值有效地从熊猫数据框中提取信息

[英]Efficiently extracting information from a pandas dataframe based on two column values

I am trying to extract information from a data frame which is indexed by productId and customerId.我正在尝试从由 productId 和 customerId 索引的数据框中提取信息。 I have a large number (millions) of (productId, customerId) pairs and am interested in finding the most efficient way possible to do this.我有大量(数百万)(productId,customerId)对,并且有兴趣找到最有效的方法来做到这一点。

I have two data frames, df1 containing the customerId, productId pairs I'm interested in, and a second frame df2 containing information of interest which is indexed by customerId, productId pairs.我有两个数据帧,df1 包含我感兴趣的 customerId、productId 对,第二个帧 df2 包含感兴趣的信息,由 customerId、productId 对索引。

So far I have tried something like:到目前为止,我已经尝试过类似的事情:

def f(x, y):
    return(df2.col[(df2.customerId == x) & (df2.productId == y)].sum())

values = df1.apply(lambda x: f(x.customerId, x.productId), axis = 1)

which works fine but is very slow.工作正常,但速度很慢。

Any suggestions on improvements?有什么改进建议吗?

您可以尝试列表理解:

values = [df2.loc[df2[['customerId', 'productId']].eq(i).all(), 'col'].sum() for i in df1.values]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM