简体   繁体   中英

How to efficiently sample combinations of rows in a pandas DataFrame

Let's say I have a pandas DataFrame with a certain number of columns and rows. What I want to do is to find the combination of 5 rows that combined yield the highest score in a particular column given some threshold. Below is a little toy example to illustrate it better:

在此输入图像描述

Below is a simplified example of my code, and I am wondering if this "brute force" approach is a smart way to tackle this problem. Is there any chance to do it more efficiently? Using other Python libraries, or are there tricks to run it faster (I thought about Cython, but I think itertools is already implemented in C so that there won't be much benefit?). Also, I wouldn't know how to use multiprocessing here, since itertools is a generator. I would welcome any discussions and ideas!

Thanks!

EDIT: Sorry, I forgot to mention that there is a second constraint. Eg, the combinations of rows have to fit certain category criteria. Eg,.

  • 1x category a
  • 2x catergoy b
  • 2x catergoy c

So, to summarize the problem: I want to find a combination of k rows that optimize score s given that the k rows belong to certain categories and don't exceed a certain score threshold in a constraint column.

from itertools import combinations
from itertools import product

# based on the suggested answer:
# sort by best score per constraint ratio:
r = df['constraint_column']/df['constraint']
r.sort(ascending=False, inplace=True)
df = df.ix[r.index]


df_a = df[df['col1'] == some_criterion] # rows from category a
df_b = df[df['col2'] == some_criterion] # rows from category b
df_c = df[df['col3'] == some_criterion] # rows from category c

score = 0.0

for i in product(
            combinations(df_a.index, r=1), 
            combinations(df_b.index, r=2), 
            combinations(df_c.index, r=2)):

    indexes = set(chain.from_iterable(i))

    df_cur = df.ix[indexes]

    if df_cur['constraint_column'].values.sum() > some_threshold:
        continue


    new_score = df_cur['score_column'].values.sum()
    if new_score > score:
        score = new_score


    # based on the suggested answer:
    # break here, since it can't get any better if the threshold is exactly
    # matched since we sorted by the best score/constraint ratio previously.

    if df_cur['constraint_column'].values.sum() == some_threshold:
        break 

i think you can solve this by just taking the best based on the "score per constraint" metric:

constraint = 6 #whatever value you want here
df['s_per_c'] = df.score / df.constraint
df.sort('s_per_c', inplace=True, ascending=False)

total = 0
for i, r in df.iterrows():
    if r.constraint > constraint:
        continue
    constraint -= r.constraint
    total += r.score
    if constraint == 0:
        break

my logic here is that every time i take a score i want to make sure that i can afford it ("constraint") and that i'm getting the best bang for my buck ("s_per_c")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM