Fast iterative changes in pandas dataframe groups

Question

I have a large pandas dataframe that consists of users, the products that each user bought and product prices.

The code I am using is showed below.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random

# Create Dataframe randomly
product_list = ['P1', 'P2', 'P3', 'P4', 'P5', 'P6', 'P7', 'P8', 'P9', 'P10', 'P11', 'P12']
user_list = ['U1', 'U2', 'U3', 'U4', 'U5', 'U6', 'U7', 'U8', 'U9', 'U10']
price_list = [50, 90, 100]

# Create random transactions
transactions = pd.DataFrame(np.random.choice(user_list, 200))
transactions['item'] = pd.DataFrame(np.random.choice(product_list, 200))
transactions['quantity'] = 1
transactions['price'] = np.random.choice([50, 90, 100], 200)
transactions.columns = ['user', 'item', 'quantity', 'price']
transactions['suggested_price'] = 0

# Create groups to apply suggested discount
grouped = transactions.groupby(["user", "item"])

# Apply suggested discount
for key, group in grouped:
    transactions.set_value(
        group.index, 'suggested_discount', np.random.random())

My biggest problem with this code is the performance of the last block of code that applies the suggested discount to each user (customer). The original dataframe has over 6 million rows.

Also, one thing I noticed is that the slowest step is when I change the value of the groups, ie, the line:

transactions.set_value(
            group.index, 'suggested_discount', np.random.random())

In the original code there are other steps before this line of code.

I was not expecting that changing the values of the group columns would be so slow. Is there a better, faster implementation?

Thanks!

Answer 1

Let's say that instead of np.random.random() you have a function that takes arguments from the price and suggested price columns, you should try to either use apply or transform or agg to those columns. Scalar operations instead of for loop will be much quicker.

For example, first set the user and item fields as indexes, then you can directly set a value from the grouped data to that new dataframe:

tr=transactions.set_index(["user","item"]) 
tr["suggested discount"]=transactions.groupby(["user", "item"])[["price","suggested_price"]].apply(pd.Series.sum)

Anyhow, the key is not using a for loop.

Fast iterative changes in pandas dataframe groups

Question

1 answers

solution1
0 2017-07-03 22:39:07

Fast iterative changes in pandas dataframe groups

Question

1 answers

solution1 0 2017-07-03 22:39:07

solution1
0 2017-07-03 22:39:07