简体   繁体   中英

Fast iterative changes in pandas dataframe groups

I have a large pandas dataframe that consists of users, the products that each user bought and product prices.

The code I am using is showed below.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random

# Create Dataframe randomly
product_list = ['P1', 'P2', 'P3', 'P4', 'P5', 'P6', 'P7', 'P8', 'P9', 'P10', 'P11', 'P12']
user_list = ['U1', 'U2', 'U3', 'U4', 'U5', 'U6', 'U7', 'U8', 'U9', 'U10']
price_list = [50, 90, 100]

# Create random transactions
transactions = pd.DataFrame(np.random.choice(user_list, 200))
transactions['item'] = pd.DataFrame(np.random.choice(product_list, 200))
transactions['quantity'] = 1
transactions['price'] = np.random.choice([50, 90, 100], 200)
transactions.columns = ['user', 'item', 'quantity', 'price']
transactions['suggested_price'] = 0

# Create groups to apply suggested discount
grouped = transactions.groupby(["user", "item"])

# Apply suggested discount
for key, group in grouped:
    transactions.set_value(
        group.index, 'suggested_discount', np.random.random())

My biggest problem with this code is the performance of the last block of code that applies the suggested discount to each user (customer). The original dataframe has over 6 million rows.

Also, one thing I noticed is that the slowest step is when I change the value of the groups, ie, the line:

transactions.set_value(
            group.index, 'suggested_discount', np.random.random())

In the original code there are other steps before this line of code.

I was not expecting that changing the values of the group columns would be so slow. Is there a better, faster implementation?

Thanks!

Let's say that instead of np.random.random() you have a function that takes arguments from the price and suggested price columns, you should try to either use apply or transform or agg to those columns. Scalar operations instead of for loop will be much quicker.

For example, first set the user and item fields as indexes, then you can directly set a value from the grouped data to that new dataframe:

tr=transactions.set_index(["user","item"]) 
tr["suggested discount"]=transactions.groupby(["user", "item"])[["price","suggested_price"]].apply(pd.Series.sum)

Anyhow, the key is not using a for loop.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM