与for循环相比，选择数据帧不同部分的方法更快？

Question

I have a dataframe of instacart orders 我有一个Instacart订单数据框

order_id    product_id  add_to_cart_order   reordered   product_name    
32          49683       7                   1           Cucumber Kirby  
52          49683       4                   1           Cucumber Kirby  
88          49683       20                  0           Cucumber Kirby  
95          49683       12                  1           Cucumber Kirby  
111         49683       5                   1           Cucumber Kirby

reordered is either 1 or 0 indicating whether the customer had ordered that product in a previous order or not. 重新订购为1或0，表示客户是否已按先前的订单订购该产品。

I want to get info on a per-product basis, for example I would like to know which products have the most number of reorders (among other things). 我想获取每个产品的信息，例如，我想知道哪些产品的再订货数量最多（除其他外）。 The only way I can think of to do this is to iterate through the dataframe, selecting only rows by product name one at a time, and summing the values of reordered for each product. 我能想到的唯一方法是遍历数据框，一次仅按产品名称选择一行，然后对每个产品的重新排序值求和。 Only problem is there are about 92k different products and this is breaking my computer and taking forever. 唯一的问题是大约有9万2千种不同的产品，这使我的电脑瘫痪，并且永远无法解决。 Here's my code. 这是我的代码。 I'm saving the results to a dictionary but I'm open to other approaches. 我将结果保存到字典中，但可以使用其他方法。 There must be a more efficient way to do this? 必须有一种更有效的方法来做到这一点吗？

reordersums = {}
for product in list(products.product_name):
# Select the rows whose product name matches the product we are checking, sum the values in column "reordered"
    reordersum = order_products[order_products.product_name == product].reordered.sum()

    reordersums[product]=reordersum    
print(reordersums)

Answer 1

Try using the group_by interface: 尝试使用group_by接口：

# Group up the dataframe by product
group_products = products.groupby('product_name')

# Sum the groups on the reordered column
reordered_sums = group_products['reordered'].agg('sum')

Answer 2

Please try below, However i'm not sure it this is what you are looking for: 请在下面尝试，但是我不确定这是您要寻找的东西：

Your illustrated DataFrame Structure: 您说明的DataFrame结构：

   order_id  product_id  add_to_cart_order  reordered    product_name
0        32       49683                  7          1  Cucumber Kirby
1        52       49683                  4          1  Cucumber Kirby
2        88       49683                 20          0  Cucumber Kirby
3        95       49683                 12          1  Cucumber Kirby
4       111       49683                  5          1  Cucumber Kirby

Solution: groupby + DataFrame.filter + sum() 解决方案： groupby + DataFrame.filter + sum（）

>>> df.groupby('reordered').filter(lambda x: len(x) > 1).groupby(['product_name']).sum().reset_index()
     product_name  order_id  product_id  add_to_cart_order  reordered
0  Cucumber Kirby       290      198732                 28          4

OR , as suggested by @Amit in comment Section. 或，如@Amit在评论部分中所建议。

>>> df[df.reordered==1].groupby('product_name').sum().reset_index()
     product_name  order_id  product_id  add_to_cart_order  reordered
0  Cucumber Kirby       290      198732                 28          4

OR , In case you want only to see product_name & reordered 或，如果您只想查看product_name并reordered

df.set_index('product_name').reordered.ge(1).sum(level=0).astype(int).reset_index()
     product_name  reordered
0  Cucumber Kirby          4

与for循环相比，选择数据帧不同部分的方法更快？

问题描述

2 个解决方案

解决方案1
0 2019-01-05 16:12:01

解决方案2
0 已采纳 2019-01-05 16:35:35

与for循环相比，选择数据帧不同部分的方法更快？

问题描述

2 个解决方案

解决方案1 0 2019-01-05 16:12:01

解决方案2 0 已采纳 2019-01-05 16:35:35

解决方案1
0 2019-01-05 16:12:01

解决方案2
0 已采纳 2019-01-05 16:35:35