[英]Apply a function on elements in a Pandas column, grouped on another column
I have a dataset with several columns.我有一个包含几列的数据集。 Now what I want is to basically calculate score based on a particular column ("name") but grouped on the "id" column.现在我想要的是基本上根据特定列(“名称”)计算分数,但在“id”列上分组。
_id fName lName age
0 ABCD Andrew Schulz
1 ABCD Andreww 23
2 DEFG John boy
3 DEFG Johnn boy 14
4 CDGH Bob TANNA 13
5 ABCD. Peter Parker 45
6 DEFGH Clark Kent 25
So what I am looking is whether for the same id, I am getting similar entries, so I can remove those entries based on a threshold score values.所以我正在寻找的是对于相同的 id,我是否获得了相似的条目,所以我可以根据阈值分数值删除这些条目。 Like here if i run it for col "fName".如果我为 col“fName”运行它,就像这里一样。 I should be able to reduce this dataframe to based on a score threshold:我应该能够根据分数阈值将此 dataframe 减少到:
_id fName lName age
0 ABCD Andrew Schulz 23
2 DEFG John boy 14
4 CDGH Bob TANNA 13
5 ABCD Peter Parker 45
6 DEFG Clark Kent 25
I intend to use pyjarowinkler.我打算使用 pyjarowinkler。 If I had two independent columns (without all the group by stuff) to check, this is how I use it.如果我有两个独立的列(没有所有分组)要检查,这就是我使用它的方式。
df['score'] = [distance.get_jaro_distance(x, y) for x, y in zip(df['name_1'],df['name_2'])]
df = df[df['score'] > 0.87]
Can someone suggest a pythonic and fast way of doing this有人可以建议一种pythonic和快速的方法吗
UPDATE 更新So, I have tried using record linkage library for this.因此,我尝试为此使用记录链接库。 And I have ended up at a dataframe containing pair of indexes that are similar called 'matches'.我最终得到了一个 dataframe ,其中包含一对相似的索引,称为“匹配”。 Now I just want to basically combine the data.现在我只想基本合并数据。
index1 index2 fName
0 1 1.0
2 3 1.0
This is how matches looks:这是匹配的外观:
index1 index2 fName 0 1 1.0 2 3 1.0
I need someone to suggest a way to combine the similar rows in a way that takes data from similar rows我需要有人建议一种方法来组合相似的行,以从相似的行中获取数据
just wanted to clear some doubts regarding your ques.只是想清除对您的问题的一些疑问。 Couldn't clear them in comments due to low reputation.由于声誉低,无法在评论中清除它们。
Like here if i run it for col "fName".如果我为 col“fName”运行它,就像这里一样。 I should be able to reduce this dataframe to based on a score threshold:我应该能够根据分数阈值将此 dataframe 减少到:
So basically your function would return the DataFrame containing the first row in each group (by ID)?所以基本上你的 function 会返回包含每个组中第一行的 DataFrame(按 ID)? This will result in the above listed resultant DataFrame.这将导致上面列出的结果 DataFrame。
_id fName lName age
0 ABCD Andrew Schulz 23
2 DEFG John boy 14
4 CDGH Bob TANNA 13
I hope this code answer your question我希望这段代码能回答你的问题
r0 =['ABCD','Andrew','Schulz', '' ]
r1 =['ABCD','Andrew', '' , '23' ]
r2 =['DEFG','John' ,'boy' , '' ]
r3 =['DEFG','John' ,'boy' , '14' ]
r4 =['CDGH','Bob' ,'TANNA' , '13' ]
Rx =[r0,r1,r2,r3,r4]
print(Rx)
print()
Dict= dict()
for i in Rx:
if (Dict.__contains__(i[0]) == True):
if (i[2] != ''):
Dict[i[0]][2] = i[2]
if (i[3] != ''):
Dict[i[0]][3] = i[3]
else:
Dict[i[0]]=i
Rx[:] = Dict.values()
print(Rx)
I am lost with the 'score' part of your question, but if what you need is to fill the gaps in data with values from other rows and then drop the duplicates by id, maybe this can help:我对您问题的“分数”部分感到迷茫,但是如果您需要用其他行的值填充数据中的空白,然后按 id 删除重复项,也许这会有所帮助:
df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')
First make sure that empty values are replaced with nulls.首先确保将空值替换为空值。 Then use fillna to 'back fill' the data.然后使用fillna来“回填”数据。 Then drop duplicates keeping the first occurrence of Id.然后删除重复项,保留第一次出现的 Id。 fillna
will fill the values from the next value found in the column, which may correspond to other Id, but since you will discard the duplicated rows, I believe drop_duplicates
keeping the first occurrence will do the job. fillna
将从列中找到的下一个值填充值,该值可能对应于其他 Id,但由于您将丢弃重复的行,我相信drop_duplicates
保持第一次出现将完成这项工作。 (This assumes that at least one value is provided in every column for every Id) (这假设每个 Id 的每一列中至少提供一个值)
I've tested with this dataset and code:我已经用这个数据集和代码进行了测试:
data = [
['AABBCC', 'Andrew', '',],
['AABBCC', 'Andrew', 'Schulz'],
['AABBCC', 'Andrew', '', 23],
['AABBCC', 'Andrew', '',],
['AABBCC', 'Andrew', '',],
['DDEEFF', 'Karl', 'boy'],
['DDEEFF', 'Karl', ''],
['DDEEFF', 'Karl', '', 14],
['GGHHHH', 'John', 'TANNA', 13],
['HLHLHL', 'Bob', ''],
['HLHLHL', 'Bob', ''],
['HLHLHL', 'Bob', 'Blob'],
['HLHLHL', 'Bob', 'Blob', 15],
['HLHLHL', 'Bob','', 15],
['JLJLJL', 'Nick', 'Best', 20],
['JLJLJL', 'Nick', '']
]
df = pd.DataFrame(data, columns=['Id', 'fName', 'lName', 'Age'])
df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')
Output: Output:
Id fName lName Age
0 AABBCC Andrew Schulz 23.0
5 DDEEFF Karl boy 14.0
8 GGHHHH John TANNA 13.0
9 HLHLHL Bob Blob 15.0
14 JLJLJL Nick Best 20.0
Hope this helps and apologies if I misunderstood the question.如果我误解了这个问题,希望这会有所帮助并道歉。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.