A beginner to pandas and python, I'm trying to find select the 10 rows in a dataframe such that the following requirements are fulfilled:
The concept I struggle with is how to do all of this at the same time. In this case, the goal is to select 10 rows resulting in a subset where sum of OPW
is maximized, while the sum of salary
remains below an integer threshold, and all strings in POS
are unique. If it helps understanding the problem, I'm basically trying to come up with the baseball dream team on a budget, with OPW
being the metric for how well the player performs and POS
being the position I would assign them to. The current dataframe looks like this:
playerID OPW POS salary
87 bondsba01 62.061290 OF 8541667
439 heltoto01 41.002660 1B 10600000
918 thomafr04 38.107000 1B 7000000
920 thomeji01 37.385272 1B 6337500
68 berkmla01 36.210367 1B 10250000
785 ramirma02 35.785630 OF 13050000
616 martied01 32.906884 3B 3500000
775 pujolal01 32.727629 1B 13870949
966 walkela01 30.644305 OF 6050000
354 giambja01 30.440007 1B 3103333
859 sheffga01 29.090699 OF 9916667
511 jonesch06 28.383418 3B 10833333
357 gilesbr02 28.160054 OF 7666666
31 bagweje01 27.133545 1B 6875000
282 edmonji01 23.486406 CF 4500000
0 abreubo01 23.056375 RF 9000000
392 griffke02 22.965706 OF 8019599
... ... ... ...
If my team was only 3 people, with a OF
, 1B
, and 3B
, and I had a sum salary
threshold of $19,100,000, I would get the following team:
playerID OPW POS salary
87 bondsba01 62.061290 OF 8541667
918 thomafr04 38.107000 1B 7000000
616 martied01 32.906884 3B 3500000
The output would ideally be another dataframe with just the 10 rows that fulfill the requirements. The only solution I can think of is to bootstrap a bunch of teams (10 rows) with each row having a unique POS
, remove teams above the 'salary' sum threshold, and then sort_value()
the teams by df.OPW.sum()
. Not sure how to implement that though. Perhaps there is a more elegant way to do this? Edit: Changed dataframe to provide more information, added more context.
IIUC you can use groupby
with aggregating sum
:
df1 = df.groupby('category', as_index=False).sum()
print (df1)
category value cost
0 A 70 2450
1 B 67 1200
2 C 82 1300
3 D 37 4500
Then filter by boolean indexing
with treshold
:
tresh = 3000
df1 = df1[df1.cost < tresh]
And last get top 10 values by nlargest
:
#in sample used top 3, in real data is necessary set to 10
print (df1.nlargest(3,columns=['value']))
category value cost
2 C 82 1300
0 A 70 2450
1 B 67 1200
This is a linear programming problem. For each POS, you're trying to maximize individual OPW while total salary across the entire team is subject to a constraint. You can't solve this with simple pandas operations, but PuLP could be used to formulate and solve it (see the Case Studies there for some examples).
However, you could get closer to a manual solution by using pandas to group by (or sort by) POS and then either (1) sort by OPW descending and salary ascending, or (2) add some kind of "return on investment" column (OPW divided by salary, perhaps) and sort on that descending to find the players that give you the biggest bang for the buck in each position.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.