简体   繁体   中英

Eliminate rows in a dataframe after looping through all rows of column and applying condition

This is my data frame with 5 rows and 3 columns:

df = pd.DataFrame({'A': [1,4,4,3,7], 'B': [1,2,2,6,4], 'C': [1,2,2,6,4]})

I have to find a way to drop a row if the datapoint in column A is finding a value higher than itself in column A, and the B value of that row is lower than the B value of the querying row.

For example, in the above data frame row 4 has to be dropped because it has higher values in column A (4, 7) with less B value (2,4).

I will modify the question in a application perspective for better clarity. Sorry for my bad presentation skills.

Lets say this is our dataframe

df = pd.DataFrame({'resources': [100,200,300,300,400,400,400,500,1000],
                   'score': [1,2,1,2,3,5,6,8,9]})

I want to find a trade-off with resources i use and my score. My priority is to get the best score with less resources. I iterate all combinations and see if a row is eligible to be considered. So basically in this 9 rows, rows 3, 4, 5,6 should be eliminated 3 because 1 gives the same score with less resource, 4 because 2 gives the same score with less resource, 5 and 6 because 7 gives a better score with same resource. I hope this will make my problem more clear.

Starting from:

    A   B   C
0   1   1   1
1   4   2   2
2   4   2   2
3   3   6   6
4   7   4   4

You can do:

mask_a = df.A.apply(lambda x: (df.A > x).any())
mask_b = df.B.apply(lambda x: (df.B <= x).all())

df[~(mask_a & mask_b)]

# Output
    A   B   C
0   1   1   1
1   4   2   2
2   4   2   2
4   7   4   4

I want to find a trade-off with resources i use and my score. My priority is to get the best score with less resources

Ok, then I'd do this:

df["ratio"] = df.score / df.resources
df.sort_values("ratio", ascending=False)

Which gives you:

    resources   score   ratio
7   500         8       0.016000
6   400         6       0.015000
5   400         5       0.012500
0   100         1       0.010000
1   200         2       0.010000
8   1000        9       0.009000
4   400         3       0.007500
3   300         2       0.006667
2   300         1       0.003333

Now you can see the best trade-offs at the top and worst ones at the bottom.

IIUC, you want to get the max score per resource and the min resource per score simultaneously.

You can compute both and get the intersection:

import numpy as np

idx1 = df.groupby('resources')['score'].idxmax().values
# array([0, 1, 3, 6, 7, 8])

idx2 = df.groupby('score')['resources'].idxmin().values
# array([0, 1, 4, 5, 6, 7, 8])

out = df.loc[np.intersect1d(idx1, idx2)]

output:

   resources  score
0        100      1
1        200      2
6        400      6
7        500      8
8       1000      9

used input:

df = pd.DataFrame({'resources': [100,200,300,300,400,400,400,500,1000],
                   'score': [1,2,1,2,3,5,6,8,9]})

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM