简体   繁体   English

在 Pandas 列中的元素上应用 function,分组到另一列

[英]Apply a function on elements in a Pandas column, grouped on another column

I have a dataset with several columns.我有一个包含几列的数据集。 Now what I want is to basically calculate score based on a particular column ("name") but grouped on the "id" column.现在我想要的是基本上根据特定列(“名称”)计算分数,但在“id”列上分组。

         _id      fName        lName    age
0       ABCD     Andrew       Schulz    
1       ABCD    Andreww                  23
2       DEFG       John          boy
3       DEFG      Johnn          boy     14
4       CDGH        Bob        TANNA     13
5       ABCD.     Peter        Parker    45
6       DEFGH     Clark          Kent    25

So what I am looking is whether for the same id, I am getting similar entries, so I can remove those entries based on a threshold score values.所以我正在寻找的是对于相同的 id,我是否获得了相似的条目,所以我可以根据阈值分数值删除这些条目。 Like here if i run it for col "fName".如果我为 col“fName”运行它,就像这里一样。 I should be able to reduce this dataframe to based on a score threshold:我应该能够根据分数阈值将此 dataframe 减少到:

         _id      fName        lName   age
0       ABCD     Andrew       Schulz    23
2       DEFG       John          boy    14
4       CDGH        Bob        TANNA    13
5       ABCD      Peter       Parker    45
6       DEFG      Clark         Kent    25

I intend to use pyjarowinkler.我打算使用 pyjarowinkler。 If I had two independent columns (without all the group by stuff) to check, this is how I use it.如果我有两个独立的列(没有所有分组)要检查,这就是我使用它的方式。

    df['score'] = [distance.get_jaro_distance(x, y) for x, y in zip(df['name_1'],df['name_2'])]
    df = df[df['score'] > 0.87]

Can someone suggest a pythonic and fast way of doing this有人可以建议一种pythonic和快速的方法吗

UPDATE 更新

So, I have tried using record linkage library for this.因此,我尝试为此使用记录链接库。 And I have ended up at a dataframe containing pair of indexes that are similar called 'matches'.我最终得到了一个 dataframe ,其中包含一对相似的索引,称为“匹配”。 Now I just want to basically combine the data.现在我只想基本合并数据。

index1   index2          fName
0           1             1.0
2           3             1.0

This is how matches looks:这是匹配的外观:

 index1 index2 fName 0 1 1.0 2 3 1.0

I need someone to suggest a way to combine the similar rows in a way that takes data from similar rows我需要有人建议一种方法来组合相似的行,以从相似的行中获取数据

just wanted to clear some doubts regarding your ques.只是想清除对您的问题的一些疑问。 Couldn't clear them in comments due to low reputation.由于声誉低,无法在评论中清除它们。

Like here if i run it for col "fName".如果我为 col“fName”运行它,就像这里一样。 I should be able to reduce this dataframe to based on a score threshold:我应该能够根据分数阈值将此 dataframe 减少到:

So basically your function would return the DataFrame containing the first row in each group (by ID)?所以基本上你的 function 会返回包含每个组中第一行的 DataFrame(按 ID)? This will result in the above listed resultant DataFrame.这将导致上面列出的结果 DataFrame。

         _id      fName        lName   age
0       ABCD     Andrew       Schulz    23
2       DEFG       John          boy    14
4       CDGH        Bob        TANNA    13

I hope this code answer your question我希望这段代码能回答你的问题

r0 =['ABCD','Andrew','Schulz',  ''  ]
r1 =['ABCD','Andrew',   ''   , '23' ]
r2 =['DEFG','John'  ,'boy'   , ''   ]
r3 =['DEFG','John'  ,'boy'   , '14' ]
r4 =['CDGH','Bob'   ,'TANNA' , '13' ]

Rx =[r0,r1,r2,r3,r4]

print(Rx)
print()

Dict= dict()

for i in Rx:
    if (Dict.__contains__(i[0]) == True):
        if (i[2] != ''):
            Dict[i[0]][2] = i[2]
        if (i[3] != ''):
            Dict[i[0]][3] = i[3]
    else:
        Dict[i[0]]=i


Rx[:] = Dict.values()

print(Rx)

I am lost with the 'score' part of your question, but if what you need is to fill the gaps in data with values from other rows and then drop the duplicates by id, maybe this can help:我对您问题的“分数”部分感到迷茫,但是如果您需要用其他行的值填充数据中的空白,然后按 id 删除重复项,也许这会有所帮助:

df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')

First make sure that empty values are replaced with nulls.首先确保将空值替换为空值。 Then use fillna to 'back fill' the data.然后使用fillna来“回填”数据。 Then drop duplicates keeping the first occurrence of Id.然后删除重复项,保留第一次出现的 Id。 fillna will fill the values from the next value found in the column, which may correspond to other Id, but since you will discard the duplicated rows, I believe drop_duplicates keeping the first occurrence will do the job. fillna将从列中找到的下一个值填充值,该值可能对应于其他 Id,但由于您将丢弃重复的行,我相信drop_duplicates保持第一次出现将完成这项工作。 (This assumes that at least one value is provided in every column for every Id) (这假设每个 Id 的每一列中至少提供一个值)

I've tested with this dataset and code:我已经用这个数据集和代码进行了测试:

data = [
    ['AABBCC', 'Andrew', '',],
    ['AABBCC', 'Andrew', 'Schulz'],
    ['AABBCC', 'Andrew', '', 23],
    ['AABBCC', 'Andrew', '',],
    ['AABBCC', 'Andrew', '',],
    ['DDEEFF', 'Karl', 'boy'],
    ['DDEEFF', 'Karl', ''],
    ['DDEEFF', 'Karl', '', 14],
    ['GGHHHH', 'John', 'TANNA', 13],
    ['HLHLHL', 'Bob', ''],
    ['HLHLHL', 'Bob', ''],
    ['HLHLHL', 'Bob', 'Blob'],
    ['HLHLHL', 'Bob', 'Blob', 15],
    ['HLHLHL', 'Bob','', 15],
    ['JLJLJL', 'Nick', 'Best', 20],
    ['JLJLJL', 'Nick', '']
]

df = pd.DataFrame(data, columns=['Id', 'fName', 'lName', 'Age'])

df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')

Output: Output:

    Id      fName   lName   Age
0   AABBCC  Andrew  Schulz  23.0
5   DDEEFF  Karl    boy     14.0
8   GGHHHH  John    TANNA   13.0
9   HLHLHL  Bob     Blob    15.0
14  JLJLJL  Nick    Best    20.0

Hope this helps and apologies if I misunderstood the question.如果我误解了这个问题,希望这会有所帮助并道歉。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM