[英]Join two Pandas dataframes, sampling from the smaller dataframe
[英]Join two larger dataframes into smaller multindex dataframe according to matching row numbers?
我正在使用 recordlinkage 来查找两个大小不等的数据帧之间的匹配项,它输出一个多索引 dataframe(特征),其中包含第一个索引(左)中 df_a 匹配的相应行号,第二个索引(右)同样适用于 df_b . 我只想合并 df_a 和 df_b 中正确索引处的匹配项,如下所示:
df_a
col_a col_b col_c
a
b
c
d
e
df_b
col_1 col_2 col_3
1
2
3
4
5
6
7
8
9
10
特征
match
left right
a 2
3
b 7
8
9
以merge2结束
match col_a col_b col_c col_1 col_2 col_3
left right
a 2
3
b 7
8
9
这是相关的片段:
for i in range(0,in_a_lines,chunks):
if i < in_a_lines - chunks:
df_a_subset = df_a.iloc[i:i+chunks]
else:
df_a_subset = df_a.iloc[i:in_a_lines]
indexer = recordlinkage.Index()
indexer.block(left_on = [comp_left], right_on = [comp_right])
pairs_subset = indexer.index(df_a_subset, df_b)
comp = recordlinkage.Compare()
comp.string(left_on = comp_left, right_on = comp_right, method='jarowinkler', threshold = 0.85)
features = comp.compute(pairs_subset, df_a_subset, df_b).rename_axis(['left', 'right'])
print(str(i+chunks)+"/"+str(in_a_lines)+"\nPotential matches: "+str(len(features)))
merge1 = df_b.join(features, on=['right'])
merge2 = df_a_subset.join(merge1, on = ['left'])
merge2.to_csv(out_csv,
header = None,
index = None,
mode='a',
chunksize=chunks)
我只是把订单弄混了,剩下的 dataframe 需要调用join:
merge1 = features.join(df_a_subset, on='left', how = 'inner')
merge2 = merge1.join(df_b, on='right', how = 'inner')
如果要删除包含记录链接匹配条件的行,请添加
merge2 = merge2.drop(merge2.columns[[0,1]], axis=1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.