简体   繁体   中英

iterate on rows of dataframe with conditional flag value in python

I'm doing a cross-check between 2 dataframes to assign a value to a flag. If a specific key is present in both dataframes with a different value, the flag will be set to "change" for that row. If the value is the same, the flag will be set to "no change". However if a specific key is present more than once in only one of the 2 dataframes, then the value of the flag will be "add". Let me give an example to make it clearer:

df 1 :

key value key value present in df 2
abcd 1 False
wxyz 5 True

df 2:

key value key value present in df 1
abcd 2 False
wxyz 5 True

Then the result will be for dataframe 1:

df 1 :

key value key value present in df 2 xcheck_flag
abcd 1 False change
wxyz 5 True no change

To get this result I use the following logic:

def changeType(df1):
    def condition_check(row):
            if (row['key value present in df 2'] == False):
                    return 'change'
            else:
                    return 'no change'
    df1['xcheck_flag']= df1.apply(condition_check, axis=1)

Now this is rather straightforward, right? Well I have a complication which I haven't been able to solve, yet.

Imagine the following use case :

df 1 :

key value key value present in df 2
abcd 1 False
wxyz 5 True
abcd 3 False

df 2:

key value key value present in df 1
abcd 2 False
wxyz 5 True

In this case, the key abcd appears twice in df 1 and only once in df 2. If this happen, I need to apply the following logic when doing the cross-dataframe check: the first time I will match the key with dataframe 2, then set the value of the flag to change like in previous case; the second time we match the value, then set the flag to "additional change". It doesn't matter which row from df 1 gets assigned the value "change" or "additional". The only condition is that when you have such a case, only one key-value gets assigned with "change" and then all the others that might happen get assigned with "additional"

This give us:

df 1 :

key value key value present in df 2 xcheck_flag
abcd 1 False change
wxyz 5 True no change
abcd 3 True additional change

I've been trying to adapt my initial function to include this behaviour but without success.

If you have any hint, it would be greatly welcomed

I would probably do something like this:

import pandas as pd

df1 = pd.DataFrame({'key': ['abcd', 'wxyz', 'abcd'], 'value': [1, 5, 3]})
df2 = pd.DataFrame({'key': ['abcd', 'wxyz'], 'value': [2, 5]})
df1['key_duplicated'] = df1.duplicated('key', keep='first')
df3 = df1.join(df2.set_index(['key']), rsuffix='_2', on=['key'])

which gives you a dataframe which I think contains all the columns you need to calculate the flags you're interested in:

    key    value  key_duplicated  value_2
0  abcd        1           False        2
1  wxyz        5           False        5
2  abcd        3            True        2

note if the key is not present in df2 value_2 will be NaN.

One solution could be using a dictionary to store the number of occurrences of each key:

def check(key, value, df2):
    flag = ''
    if seen[key] > 0:
        flag = 'additional change'
    else:
        if value == df2[df2['key']==key]['value'].tolist()[0]:
            flag = 'no change'
        else:
            flag = 'change'
    seen[key] += 1
    return flag

seen = {k: 0 for k in df1['key'].tolist()}   
df1['flag'] = df1.apply(lambda row: check(row['key'], row['value'], df2), axis=1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM