简体   繁体   中英

Apply a function on elements in a Pandas column, grouped on another column

I have a dataset with several columns. Now what I want is to basically calculate score based on a particular column ("name") but grouped on the "id" column.

         _id      fName        lName    age
0       ABCD     Andrew       Schulz    
1       ABCD    Andreww                  23
2       DEFG       John          boy
3       DEFG      Johnn          boy     14
4       CDGH        Bob        TANNA     13
5       ABCD.     Peter        Parker    45
6       DEFGH     Clark          Kent    25

So what I am looking is whether for the same id, I am getting similar entries, so I can remove those entries based on a threshold score values. Like here if i run it for col "fName". I should be able to reduce this dataframe to based on a score threshold:

         _id      fName        lName   age
0       ABCD     Andrew       Schulz    23
2       DEFG       John          boy    14
4       CDGH        Bob        TANNA    13
5       ABCD      Peter       Parker    45
6       DEFG      Clark         Kent    25

I intend to use pyjarowinkler. If I had two independent columns (without all the group by stuff) to check, this is how I use it.

    df['score'] = [distance.get_jaro_distance(x, y) for x, y in zip(df['name_1'],df['name_2'])]
    df = df[df['score'] > 0.87]

Can someone suggest a pythonic and fast way of doing this

UPDATE

So, I have tried using record linkage library for this. And I have ended up at a dataframe containing pair of indexes that are similar called 'matches'. Now I just want to basically combine the data.

index1   index2          fName
0           1             1.0
2           3             1.0

This is how matches looks:

 index1 index2 fName 0 1 1.0 2 3 1.0

I need someone to suggest a way to combine the similar rows in a way that takes data from similar rows

just wanted to clear some doubts regarding your ques. Couldn't clear them in comments due to low reputation.

Like here if i run it for col "fName". I should be able to reduce this dataframe to based on a score threshold:

So basically your function would return the DataFrame containing the first row in each group (by ID)? This will result in the above listed resultant DataFrame.

         _id      fName        lName   age
0       ABCD     Andrew       Schulz    23
2       DEFG       John          boy    14
4       CDGH        Bob        TANNA    13

I hope this code answer your question

r0 =['ABCD','Andrew','Schulz',  ''  ]
r1 =['ABCD','Andrew',   ''   , '23' ]
r2 =['DEFG','John'  ,'boy'   , ''   ]
r3 =['DEFG','John'  ,'boy'   , '14' ]
r4 =['CDGH','Bob'   ,'TANNA' , '13' ]

Rx =[r0,r1,r2,r3,r4]

print(Rx)
print()

Dict= dict()

for i in Rx:
    if (Dict.__contains__(i[0]) == True):
        if (i[2] != ''):
            Dict[i[0]][2] = i[2]
        if (i[3] != ''):
            Dict[i[0]][3] = i[3]
    else:
        Dict[i[0]]=i


Rx[:] = Dict.values()

print(Rx)

I am lost with the 'score' part of your question, but if what you need is to fill the gaps in data with values from other rows and then drop the duplicates by id, maybe this can help:

df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')

First make sure that empty values are replaced with nulls. Then use fillna to 'back fill' the data. Then drop duplicates keeping the first occurrence of Id. fillna will fill the values from the next value found in the column, which may correspond to other Id, but since you will discard the duplicated rows, I believe drop_duplicates keeping the first occurrence will do the job. (This assumes that at least one value is provided in every column for every Id)

I've tested with this dataset and code:

data = [
    ['AABBCC', 'Andrew', '',],
    ['AABBCC', 'Andrew', 'Schulz'],
    ['AABBCC', 'Andrew', '', 23],
    ['AABBCC', 'Andrew', '',],
    ['AABBCC', 'Andrew', '',],
    ['DDEEFF', 'Karl', 'boy'],
    ['DDEEFF', 'Karl', ''],
    ['DDEEFF', 'Karl', '', 14],
    ['GGHHHH', 'John', 'TANNA', 13],
    ['HLHLHL', 'Bob', ''],
    ['HLHLHL', 'Bob', ''],
    ['HLHLHL', 'Bob', 'Blob'],
    ['HLHLHL', 'Bob', 'Blob', 15],
    ['HLHLHL', 'Bob','', 15],
    ['JLJLJL', 'Nick', 'Best', 20],
    ['JLJLJL', 'Nick', '']
]

df = pd.DataFrame(data, columns=['Id', 'fName', 'lName', 'Age'])

df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')

Output:

    Id      fName   lName   Age
0   AABBCC  Andrew  Schulz  23.0
5   DDEEFF  Karl    boy     14.0
8   GGHHHH  John    TANNA   13.0
9   HLHLHL  Bob     Blob    15.0
14  JLJLJL  Nick    Best    20.0

Hope this helps and apologies if I misunderstood the question.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM