简体   繁体   中英

Pandas - find and iterate rows with matching values in multiple columns and multiply value in another column

This question is a step further to my previous one :

I edited the table so it will cause less confusion

First suppose we have a dataframe below:

data = pd.DataFrame({'id':['1','2','3','4','5','6','7','8','9','10'], 
                 'A':['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo','foo','bar'],  
                 'C':['10','10','10','50','50','50','50','8','10','20'], 
                 'D':['10','9','8','7','6','5','4','3','2','1']})

As below:

      A  C   D  id
0   foo 10  10  1
1   bar 10  9   2
2   foo 10  8   3
3   bar 50  7   4
4   foo 50  6   5
5   bar 50  5   6
6   foo 50  4   7
7   foo 8   3   8
8   foo 10  2   9
9   bar 20  1   10

What I would like to do is find match rows and then do some calculation.

for any two ids(idx, idy) in data.iterrows():
       if idx.A == idy.A and idx.C = idy.C:
       result = idx.D * idy.D

and then generate a new dataframe with three columns ['id'] , ['A'] and ['result'] .

@Jon Clements♦ answered my previous question with a very neat code below:

   df.merge(
        df.groupby(['A', 'C']).D.agg(['prod', 'count'])
        [lambda r: r['count'] > 1],
        left_on=['A', 'C'],
        right_index=True
    )

New goal:

Now I am wondering is there a method to not iterate the row_a again once it matched with row_b. In other word, I am consider these two matching rows as a pair. Once row_a and row_b became a pair, the further loop will ignore row_a (not row_b until row_b match to another row).

Take groupby().agg('prod', 'count') function as an example, I hope the 'count' of all results generated are 2 (not just a filter with ['count'] == 2 ). I don't think this is going to work using groupby() So I am thinking mechanism like for-loop may solve this question? or is there any better method?

So the expected result now is (because id1 and id3 has become a pair so it will not aggregate to id9, and for the rest iteration id3 will not match with id1. So for the following table the result of row one is 80 but not 160, and row two is not either):

     id   A   result   
0    1   foo   80   
1    3   foo   16
2    4   bar   35
3    5   foo   24

My English is not that good so I am not sure if I am explaining my question clearly. Ask me anything if you are not clear.

Thanks for any help.

A bit of a long-winded solution and nowhere near as elegant as the original solution by Jon Clements for you first problem. But I have come up with a solution without the need for a for-loop.

# sort values by A,C,id
df = df.sort_values(['A','C','id'])
# find where A and C are equal when shifted down by 1
s=(df[['A','C']] == df[['A','C']].shift()).T.apply(lambda x: x.A and x.C)

# create a new series where we take the value of D of whe A and C are equal
# and multiply it with the next value - since it's sorted it should be next A,C match
new_d = (df.iloc[df[s].index].reset_index().D * df.iloc[df[s].index+1].reset_index().D)
new_d.index = df.iloc[df[s].index].index
new_d.name = 'results'

print(new_d)
Output >
0    80
3    35
4    24
2    16
Name: results, dtype: int64

Taking the above we simply create a new column in df and assign it to new_d :

# create a new column in df and assign it to new_d
df['results'] = new_d

df.dropna()[['id','A','results']].sort_values('id')

Output:

    id  A   results
0   1   foo 80.0
2   3   foo 16.0
3   4   bar 35.0
4   5   foo 24.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM