简体   繁体   English

与for循环相比,选择数据帧不同部分的方法更快?

[英]Faster way to select different sections of dataframe than a for loop?

I have a dataframe of instacart orders 我有一个Instacart订单数据框

order_id    product_id  add_to_cart_order   reordered   product_name    
32          49683       7                   1           Cucumber Kirby  
52          49683       4                   1           Cucumber Kirby  
88          49683       20                  0           Cucumber Kirby  
95          49683       12                  1           Cucumber Kirby  
111         49683       5                   1           Cucumber Kirby  

reordered is either 1 or 0 indicating whether the customer had ordered that product in a previous order or not. 重新订购为1或0,表示客户是否已按先前的订单订购该产品。

I want to get info on a per-product basis, for example I would like to know which products have the most number of reorders (among other things). 我想获取每个产品的信息,例如,我想知道哪些产品的再订货数量最多(除其他外)。 The only way I can think of to do this is to iterate through the dataframe, selecting only rows by product name one at a time, and summing the values of reordered for each product. 我能想到的唯一方法是遍历数据框,一次仅按产品名称选择一行,然后对每个产品的重新排序值求和。 Only problem is there are about 92k different products and this is breaking my computer and taking forever. 唯一的问题是大约有9万2千种不同的产品,这使我的电脑瘫痪,并且永远无法解决。 Here's my code. 这是我的代码。 I'm saving the results to a dictionary but I'm open to other approaches. 我将结果保存到字典中,但可以使用其他方法。 There must be a more efficient way to do this? 必须有一种更有效的方法来做到这一点吗?

reordersums = {}
for product in list(products.product_name):
# Select the rows whose product name matches the product we are checking, sum the values in column "reordered"
    reordersum = order_products[order_products.product_name == product].reordered.sum()

    reordersums[product]=reordersum    
print(reordersums)

Try using the group_by interface: 尝试使用group_by接口:

# Group up the dataframe by product
group_products = products.groupby('product_name')

# Sum the groups on the reordered column
reordered_sums = group_products['reordered'].agg('sum')

Please try below, However i'm not sure it this is what you are looking for: 请在下面尝试,但是我不确定这是您要寻找的东西:

Your illustrated DataFrame Structure: 您说明的DataFrame结构:

   order_id  product_id  add_to_cart_order  reordered    product_name
0        32       49683                  7          1  Cucumber Kirby
1        52       49683                  4          1  Cucumber Kirby
2        88       49683                 20          0  Cucumber Kirby
3        95       49683                 12          1  Cucumber Kirby
4       111       49683                  5          1  Cucumber Kirby

Solution: groupby + DataFrame.filter + sum() 解决方案: groupby + DataFrame.filter + sum()

>>> df.groupby('reordered').filter(lambda x: len(x) > 1).groupby(['product_name']).sum().reset_index()
     product_name  order_id  product_id  add_to_cart_order  reordered
0  Cucumber Kirby       290      198732                 28          4

OR , as suggested by @Amit in comment Section. ,如@Amit在评论部分中所建议。

>>> df[df.reordered==1].groupby('product_name').sum().reset_index()
     product_name  order_id  product_id  add_to_cart_order  reordered
0  Cucumber Kirby       290      198732                 28          4

OR , In case you want only to see product_name & reordered ,如果您只想查看product_namereordered

df.set_index('product_name').reordered.ge(1).sum(level=0).astype(int).reset_index()
     product_name  reordered
0  Cucumber Kirby          4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 寻找一种比for循环更快的方法来搜索和附加带有熊猫的DataFrame - Looking for a faster way than for-loop to search and append DataFrame with Pandas 从字典列表中构建 Pandas.Dataframe 比循环更快的方法? [Python 3.9] - A faster way of building a Pandas.Dataframe from list of dictionaries than loop? [Python 3.9] 有没有办法在循环中更快地更改 DataFrame? - Is there a way to make changing DataFrame faster in a loop? Python:是否有更快的方法在 for 循环中过滤 dataframe - Python: Is there a faster way to filter on dataframe in a for loop Python - 在数据框中运行 for 循环的更快方法 - Python - faster way to run a for loop in a dataframe pandas比argsort更快的方式在数据帧子集中排名 - pandas faster way than argsort to rank in dataframe subset 制作 pandas 多索引 dataframe 比 append 更快的方法 - Faster way to make pandas Multiindex dataframe than append 为非常大的 dataframe 列表运行 for 循环的更快方法 - faster way to run a for loop for a very large dataframe list for循环比numpy平均要快,结果也有些不同 - for loop is faster than numpy average, also a bit different result 当使用 dataframe 以使其更快时,另一种编写循环的方法和 if 在 python 中 - Alternative way of writing for loop and if in python when working with a dataframe to make it faster
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM