简体   繁体   中英

Pandas groupby and weighted sum for multiple columns

I've see a dozen Pandas groupby multiple columns questions, but I'm at a loss on how to get this to run in a reasonable time. My goal is to groupby a few columns, and with the resulting subset apply np.dot across each remaining column against my weights:

# Example data:
weights = np.array([.20, .60, .20])
data = pd.DataFrame([[0, "TX", 10, 55], [0, "TX", 5, 30], [0, "TX", 2, 75], [1, "TX", 4, 30], [1, "TX", 8, 100], [1, "TX", 2, 30]], columns=["sim", "state", "x1", "x2"])

print(data)
   sim state  x1   x2
0    0    TX  10   55
1    0    TX   5   30
2    0    TX   2   75
3    1    TX   4   30
4    1    TX   8  100
5    1    TX   2   30

I couldn't get np.dot to work out of the box, so I had to break the multiplication and summation into separate steps. Here's what i've tried, but on my dataset of a few million rows this takes ~2 minutes, not to mention being pretty unreadable:

results = data.groupby(["sim", "state"]).apply(lambda sdf: (sdf[["x1", "x2"]] * weights.reshape((3,1))).sum())

print(results.reset_index())
   sim state   x1    x2
0    0    TX  5.4  44.0
1    1    TX  6.0  72.0

How about...

(df.set_index(['sim', 'state'])
   .mul(np.tile(weights, len(df) // len(weights)), axis=0)
   .sum(level=[0, 1]))

            x1    x2
sim state           
0   TX     5.4  44.0
1   TX     6.0  72.0

How this works,

  • set the index to whatever should not be multiplied ( df 's primary keys, essentially)
  • use mul to perform broadcasted multiplication with the weights
  • group on the indices and sum the weighted values.

This works under the assumption that len(df) % len(weights) == 0 .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM