[英]Faster way to select different sections of dataframe than a for loop?
I have a dataframe of instacart orders 我有一个Instacart订单数据框
order_id product_id add_to_cart_order reordered product_name
32 49683 7 1 Cucumber Kirby
52 49683 4 1 Cucumber Kirby
88 49683 20 0 Cucumber Kirby
95 49683 12 1 Cucumber Kirby
111 49683 5 1 Cucumber Kirby
reordered is either 1 or 0 indicating whether the customer had ordered that product in a previous order or not. 重新订购为1或0,表示客户是否已按先前的订单订购该产品。
I want to get info on a per-product basis, for example I would like to know which products have the most number of reorders (among other things). 我想获取每个产品的信息,例如,我想知道哪些产品的再订货数量最多(除其他外)。 The only way I can think of to do this is to iterate through the dataframe, selecting only rows by product name one at a time, and summing the values of reordered for each product.
我能想到的唯一方法是遍历数据框,一次仅按产品名称选择一行,然后对每个产品的重新排序值求和。 Only problem is there are about 92k different products and this is breaking my computer and taking forever.
唯一的问题是大约有9万2千种不同的产品,这使我的电脑瘫痪,并且永远无法解决。 Here's my code.
这是我的代码。 I'm saving the results to a dictionary but I'm open to other approaches.
我将结果保存到字典中,但可以使用其他方法。 There must be a more efficient way to do this?
必须有一种更有效的方法来做到这一点吗?
reordersums = {}
for product in list(products.product_name):
# Select the rows whose product name matches the product we are checking, sum the values in column "reordered"
reordersum = order_products[order_products.product_name == product].reordered.sum()
reordersums[product]=reordersum
print(reordersums)
Please try below, However i'm not sure it this is what you are looking for: 请在下面尝试,但是我不确定这是您要寻找的东西:
Your illustrated DataFrame Structure: 您说明的DataFrame结构:
order_id product_id add_to_cart_order reordered product_name
0 32 49683 7 1 Cucumber Kirby
1 52 49683 4 1 Cucumber Kirby
2 88 49683 20 0 Cucumber Kirby
3 95 49683 12 1 Cucumber Kirby
4 111 49683 5 1 Cucumber Kirby
Solution: groupby + DataFrame.filter + sum() 解决方案: groupby + DataFrame.filter + sum()
>>> df.groupby('reordered').filter(lambda x: len(x) > 1).groupby(['product_name']).sum().reset_index()
product_name order_id product_id add_to_cart_order reordered
0 Cucumber Kirby 290 198732 28 4
OR , as suggested by @Amit in comment Section. 或 ,如@Amit在评论部分中所建议。
>>> df[df.reordered==1].groupby('product_name').sum().reset_index()
product_name order_id product_id add_to_cart_order reordered
0 Cucumber Kirby 290 198732 28 4
OR , In case you want only to see product_name
& reordered
或 ,如果您只想查看
product_name
并reordered
df.set_index('product_name').reordered.ge(1).sum(level=0).astype(int).reset_index()
product_name reordered
0 Cucumber Kirby 4
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.