简体   繁体   中英

Finding subset of dataframe rows that maximize one column sum while limiting sum of another

A beginner to pandas and python, I'm trying to find select the 10 rows in a dataframe such that the following requirements are fulfilled:

  1. Only 1 of each category in a categorical column
  2. Maximize sum of a column
  3. While keeping sum of another column below a specified threshold

The concept I struggle with is how to do all of this at the same time. In this case, the goal is to select 10 rows resulting in a subset where sum of OPW is maximized, while the sum of salary remains below an integer threshold, and all strings in POS are unique. If it helps understanding the problem, I'm basically trying to come up with the baseball dream team on a budget, with OPW being the metric for how well the player performs and POS being the position I would assign them to. The current dataframe looks like this:

    playerID    OPW        POS  salary
87  bondsba01   62.061290   OF  8541667
439 heltoto01   41.002660   1B  10600000
918 thomafr04   38.107000   1B  7000000
920 thomeji01   37.385272   1B  6337500
68  berkmla01   36.210367   1B  10250000
785 ramirma02   35.785630   OF  13050000
616 martied01   32.906884   3B  3500000
775 pujolal01   32.727629   1B  13870949
966 walkela01   30.644305   OF  6050000
354 giambja01   30.440007   1B  3103333
859 sheffga01   29.090699   OF  9916667
511 jonesch06   28.383418   3B  10833333
357 gilesbr02   28.160054   OF  7666666
31  bagweje01   27.133545   1B  6875000
282 edmonji01   23.486406   CF  4500000
0   abreubo01   23.056375   RF  9000000
392 griffke02   22.965706   OF  8019599
       ...    ...        ...     ...

If my team was only 3 people, with a OF , 1B , and 3B , and I had a sum salary threshold of $19,100,000, I would get the following team:

    playerID    OPW        POS  salary
 87 bondsba01   62.061290   OF  8541667
918 thomafr04   38.107000   1B  7000000
616 martied01   32.906884   3B  3500000

The output would ideally be another dataframe with just the 10 rows that fulfill the requirements. The only solution I can think of is to bootstrap a bunch of teams (10 rows) with each row having a unique POS , remove teams above the 'salary' sum threshold, and then sort_value() the teams by df.OPW.sum() . Not sure how to implement that though. Perhaps there is a more elegant way to do this? Edit: Changed dataframe to provide more information, added more context.

IIUC you can use groupby with aggregating sum :

df1 = df.groupby('category', as_index=False).sum()
print (df1)
  category  value  cost
0        A     70  2450
1        B     67  1200
2        C     82  1300
3        D     37  4500

Then filter by boolean indexing with treshold :

tresh = 3000
df1 = df1[df1.cost < tresh]

And last get top 10 values by nlargest :

#in sample used top 3, in real data is necessary set to 10
print (df1.nlargest(3,columns=['value']))
  category  value  cost
2        C     82  1300
0        A     70  2450
1        B     67  1200

This is a linear programming problem. For each POS, you're trying to maximize individual OPW while total salary across the entire team is subject to a constraint. You can't solve this with simple pandas operations, but PuLP could be used to formulate and solve it (see the Case Studies there for some examples).

However, you could get closer to a manual solution by using pandas to group by (or sort by) POS and then either (1) sort by OPW descending and salary ascending, or (2) add some kind of "return on investment" column (OPW divided by salary, perhaps) and sort on that descending to find the players that give you the biggest bang for the buck in each position.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM